SlideShare a Scribd company logo
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

The Main Principles of Text-to-Speech Synthesis
System
.R. Aida–Zade, C. Ardil and A.M. Sharifova
Such systems can be used in communication systems, in
information referral systems, it can be applied to help people
who lost seeing and reading ability, in acoustic dialogue of
users with computer and in others fields. In general, synthesis
of speech can be necessary in all the cases when the addressee
of the information is a person.

Abstract—In this paper, the main principles of text-to-speech
synthesis system are presented. Associated problems which arise
when developing speech synthesis system are described. Used
approaches and their application in the speech synthesis systems for
Azerbaijani language are shown.
Keywords—synthesis of Azerbaijani language, morphemes,
phonemes, sounds, sentence, speech synthesizer, intonation, accent,
pronunciation.

II. PREVIOUS WORKS
The earliest efforts to produce synthetic speech date as far
back as XVIII century. Despite the fact that the first attempts
were in the form of mechanical machines, we can say today
that these synthesizers were of a high quality. In 1779 in
St. Petersburg, Russian Professor Christian Kratzenshtein
explained physiological differences between five long vowels
(/a/, /e/, /i/, /o/, and /u/) and made an apparatus to produce
them artificially. In 1791 in Vienna, Wolfgang von Kempelen
introduced his "Acoustic-Mechanical Speech Machine". In
about mid 1800's Charles Wheatstone constructed his famous
version of von Kempelen's speaking machine.
There have been three generations of speech synthesis
systems [1]. During the first generation (1962-1977) formant
synthesis of phonemes was the dominant technology. This
technology made use of the rules based on phonetic
decomposition of sentence to formant frequency contours.
The intelligibility and naturalness were poor in such
synthesis. In the second generation of speech synthesis
methods (from 1977 to 1992) the diphones were represented
with the LPC parameters. It was shown that good
intelligibility of synthetic speech could be reliably obtained
from text input by concatenating the appropriate diphone
units. The intelligibility improved over formant synthesis, but
the naturalness of the synthetic speech remained low. The
third generation of speech synthesis technology is the period
from 1992 to the present day. This generation is marked by the
method of “unit selection synthesis” which was introduced
and perfected, by Sagisaka at ATR Labs. in Kyoto. The
resulting synthetic speech of this period was close to humangenerated speech in terms of intelligibility and naturalness.
Modern speech synthesis technologies involve quite
complicated and sophisticated methods and algorithms.
“Infovox” [2] speech synthesizer family is perhaps one of the
best known multilingual text-to-speech products available
today. The first commercial version, Infovox SA-101, was
developed in Sweden at the Royal Institute of Technology in
1982 and it is based on formant synthesis. The latest full
commercial version, Infovox 230, is available for American
and British English, Danish, Finnish, French, German,
Icelandic, Italian, Norwegian, Spanish, Swedish, and Dutch.

I. INTRODUCTION

International Science Index 37, 2010 waset.org/publications/8303

I

N the XXI century widespread use of computers opened a
new stage in information interchange between the user and
the computer. Among other things, an opportunity to input the
information to computer through speech, and to reproduce in
voice text information stored in the computer have been made
possible. The paper is dedicated to the second part of this
issue, i.e. to the computer-aided text-to-speech synthesis, that
is recognized as a very urgent problem. Nowadays the solution
of this problem can be applied in various fields. First of all, it
would be of great importance for people with weak eyesight.
In the modern world, it is practically impossible to live
without an information exchange. The people with weak
eyesight face with big problems while receiving the
information through reading. A lot of methods are used to
solve this problem. For example, the sound version of some
books is created. As a result, people with weak eyesight have
an opportunity to receive the information by listening. But
there can be a case when the sound version of the necessary
book couldn’t be found.
Therefore, the implementation of the speech technologies
for information exchange for users with weak eyesight is of a
crucial necessity. In fact, computer synthesis of speech opens
a new direction for an information transfer through the
computer. For today it is mainly possible through the monitor.
Synthesis of speech is the transformation of the text to
speech. This transformation is converting the text to the
synthetic speech that is as close to real speech as possible in
compliance with the pronunciation norms of special language.
TTS is intended to read electronic texts in the form of a book,
and also to vocalize texts with the use of speech synthesis.
When developing our system not only widely known modern
methods but also a new approach of processing speech signal
was used.
Kamil Aida-Zade is with the institute of Cybernetics of the National Academy
of Sciences, Baku, Azerbaijan.
Cemal Ardil is with the National Academy of Aviation, Baku, Azerbaijan.
Aida Sharifova is with the institute of Cybernetics of the National Academy
of Sciences, Baku, Azerbaijan.

1
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

Digital Equipment Corporation [3] (DEC) talk system is
originally descended from MITalk and Klattalk. The present
version of the system is available for American English,
German and Spanish and offers nine different voice
personalities, four male, four female and one child. The
present DECtalk system is based on digital formant synthesis.
AT&T Bell Laboratories [4] (Lucent Technologies) also has
very long traditions with speech synthesis. The first full textto-speech (TTS) system was demonstrated in Boston 1972 and
released in 1973. It was based on articulatory model
developed by Cecil Coker (Klatt 1987). The development
process of the present concatenative synthesis system was
started by Joseph Olive in mid 1970's (Bell Labs 1997). The
current system is available for English, French, Spanish,
Italian, German, Russian, Romanian, Chinese, and Japanese
(Mcbius et al. 1996).
Currently, different research teams are working at this
problem. Some of them are described in the Table 1.

MBROLA [20]
Turkish Synthesis
TTTS [21]

International Science Index 37, 2010 waset.org/publications/8303

Hadifix [6]
Waveform
Synthesis[7]
Multilingual TTS
system [8]
English Synthesis
Laureate [9]
YorkTalk [10]

SPRUCE [11]

French Synthesis
SpeechMill [12]

IMS Uni
Stuttgart
IKP Uni Bonn
TU Dresden
TI Uni
Duisburg
British
Telecom
University of
York
Essex Speech
Group

ICP [13]

The University
of Lausanne
Grenoble

CNET [14]

Lannion

Spanish Synthesis
University of Madrid [15]

LIMSI TTS System
LIMSI[16]
CNRS
Greek Synthesis
University of Patras [17]
DEMOSTHeNES[18]
Arabic Synthesis
Sakhr Software [19]

syllable-based
concatenative

Text to speech synthesis is converting the text to the
synthetic speech that is as close to real speech as possible
according to the pronunciation norms of special language.
Such systems are called text to speech (TTS) systems. Input
element of TTS system is a text, output element is synthetic
speech. There are two possible cases. When it is necessary to
pronounce the limited number of phrases (and their
pronouncing linearly does not vary), the necessary speech
material is simply recorded in advance. In this case, certain
problems are originated. For example in this approach, it is not
possible to sound the text, which is not known in advance. For
this purpose the pronounced text has to be kept in computer
memory. And it will lead to increase of the size of memory
required for information content. This will bring to essential
load of computer memory in case of much information and
can create certain problems in operation. The main approach
used in this paper is voicing of previously unknown text based
on a specific algorithm.
It is necessary to note that the approach to solving problem
of speech synthesis essentially depends on the language for
which it will be used [22], and that the majority of currently
available synthesizers basically were generated for UKEnglish, Spanish and Russian languages, and these
synthesizers had not been applied to the Azerbaijani language
yet. Azerbaijani language like as Turkish is an agglutinative
language. In the view of the specificity of the Azerbaijani
language, the special approach is required.
Every language has its own unique features. For example:
there are certain contradictions between letters and certain
sounds in English language. Thus, two different letters coming
together, sound differently than when they are used separately.
For example: letters (t), (h) separately do not sound the same
as in chain (th).
This is only one of problems faced in English language. In
other words, the place of the letters affect on how they should
be or should not be pronounced. Thus, according to the
phonetic rules of English language the first letter (k) of the
word (know) is not pronounced.
As well, Russian language has certain pronunciation
features. First of all, it should be noted that the letter (o) does
not always pronounced like sound (o). There are some features

unit selection
non-segemental
(formant)
synthesis
high level
synthesis (using
other synthesis
backends)
diphone
synthesis
diphone
synthesis
diphone
synthesis
concatenative
and formant
synthesis
diphone
synthesis
diphone
synthesis
diphone
synthesis

Cairo, Egypt

Fatih
University

III. SPECIFICATION OF LANGUAGE RESOURCES FOR
SPEECH SYNTHESIS

diphone
synthesis
mixed inventory
waveform
synthesis
formant
synthesis

University of
Athens

synthesis
diphone
synthesis

Despite existing various approaches, it is still difficult to tell
which of these approaches is more suitable or more useful. In
general, the problems faced during text-to-speech synthesis
depend on many factors: diversity of languages, specificity of
the pronunciation, accent, stress, intonation etc.

TABLE I TTS SYSTEMS

German Synthesis
German Festival [5]

Le Mons,
Belgium

concatenative

2
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

International Science Index 37, 2010 waset.org/publications/8303

based on phonetic rules of Russian language. For example: the
first letter (o) in the word (korova) is pronounced like (a).
Moreover, if we take into consideration that the letter ( ) is not
pronounced at all and only gives softness to the pronounced
word.
From the foresaid it is clear that it the synthesizer programs
developed especially for one language cannot be used in a
different language, because the specificity of one language is
not presumably typical for the others. Each program is based
on algorithms corresponding to the phonetic rules of certain
language. Till now there are no any programs of a synthesizer
type that take into consideration the specificity of Azerbaijani
language.
The Azerbaijani language has its specific features [23].
Some words aren’t pronounced as its written form in Azeri.
For example, the Azeri word “ail ” is pronounced like [ayil ],
“Aid ” like [Ayid ], “mü llim” like [m :lim]. As it is shown
the sound “y” is added to the first and second words, the
sounds “ü” and “l” aren’t pronounced in the second word.
Here is another example: the word “toqqa” is pronounced like
[tokqa], here the first sound “q” is changed into “k”.

necessary units from available acoustic units. Concatenation
of segments of written speech lays in the basis of concatenate
synthesis. As a rule, concatenate synthesis gives naturalness to
sounding of the synthesized speech. Nevertheless, the natural
fluctuations in speeches and the automated technologies of
segmentation of speech signals create noise in the generated
fragment and this decreases the naturalness of sounding.
An acoustic signal database (ASD), which consists of
fragments of a real acoustic signal, i.e. the elements of
concatenation (EC), is the basis of any system of synthesis of
the speech based on concatenation method. Dimension of
these elements can be various depending on a concrete way of
synthesis of speech, it can be phonemes, allophones, syllables,
diaphones, words etc [24].
In the system developed by us the elements of
concatenation are diaphones and various combinations of
vowels. But it is necessary to note that the study of generation
of one-syllabic words consisting of four letters (stol, dörd) is
still underway and that is why this words are included into the
base as indivisible units. The speech units used in creation
ASD are saved in WAV format. The process of creation of
ASD consists of the following stages:
Stage 1: In the initial stage, the speech database is created
on the basis of the main speech units of the donor speaker.
Stage 2: The speech units from speaker’s speech are
processed before being added into database. It’s done in the
following steps:
a) Speech signals were sampled at 16 kHz and it makes
possible to define the period T with a precision of 10-4.
b) Removal of surrounding noise from the recorded speech
units. For this purpose we use the algorithm of division of a
phrase realization into speech and pauses. It is supposed, that
the first 10 frames do not contain a speech signal. For this part
of signal we calculate mean value and dispersion of Et and Zt
and obtain statistical characteristics of noise [25].

IV. USED APPROACH
Two parameters, naturalness of sounding and intelligibility
of speech, are applied for the assessment of the quality of
synthesis system. One can say that naturalness of sounding of
a speech synthesizer depends on how many generated sounds
are close to natural human speech. By a intelligibility (ease for
understanding) of a speech synthesizer is meant the easiness of
artificial speech understanding. The ideal speech synthesizer
should possess both characteristics: naturalness of sounding
and intelligibility. Existing and being developed systems for
speech synthesis are aimed at improvement of these two
characteristics.
The idea of combination of concatenation methods and of
formant synthesis is the fundament of the system we
developed. The rough, primary basis of a formed acoustic
signal is created on the basis of concatenation of the fragments
of an acoustic signal taken from speech of the speaker, i.e., a
"donor". Then, this acoustic database is changed by the rules.
The purpose of these rules is to give the necessary prosodies
characteristics (frequency of the basic tone, duration and
energy) to the "stuck together" fragments of an acoustic signal.
The method of concatenation together with an adequate set
of base elements of compilation provides for qualitative
reproduction of spectral characteristics of a speech signal, and
the set of rules provides for the possibility of generating
natural intonation-prosodial mode of pronouncements.
Formant synthesis does not use any samples of human
speech. On the contrary, the speech message of the
synthesized speech is created by means of acoustic model.
Such parameters as own frequency, sounding and noise levels
vary in order to generate natural form of a signal of artificial
speech.
In systems of concatenate synthesis (earlier it was called
compilation), synthesis is carried out by sticking together

m

s 2 ( n)
p

E s ( m)

(1)

n m L 1

1
Z s (m)

sgn(s p (n)) sgn(s p (n 1))

m

Ln m L 1

2

(2)

where

sgn( s p ( n ))

1, s p ( n )
1, s p ( n )

0,

(3)
0.

where L is the number of frames of speech signal.
Then taking into account these characteristics and maximal
values of Et, Zt for realization of a given phrase we calculate
the threshold TE for short-time energy of a signal and the
threshold TZ for number of zeros of signal intensity. The
following formulas have been chosen experimentally:

3
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

TE

3 D ( E ,10 )

M ( E ,10 )

TZ

M ( Z ,10 )

k11maxL E t
t

(4)

3 D ( Z ,10 )

k 2 1maxL Z t
t

(5)

where
1 n

M (P, n)

n t 1

D (P, n)
n

International Science Index 37, 2010 waset.org/publications/8303

n

1

(6)

Pt

1t 1

( Pt

M ( P , n ))

2

Fig. 2. After application of the algorithm

As may be seen from the figures (Figure 1, 2), the
continuous parts containing speech are finally determined after
the algorithm application.
Stage 3: As described above, the ASD plays the main role
in speech synthesis. The information stored in AED is used in
different modules of synthesis. In our system, CE is stored in
.wav format, with 16 kHz frequency. Each wav file includes
the following elements:

(7)

If a frame X(t) contains speech, then we assign binary
variable bt to 1, otherwise we assign bt to 0. At first, it is
necessary to assign the value 1 to the frames with short-time
energy Et TE, and to assign the value 0 to the other frames.
Variables bt can accept only two values. Therefore the
filtration is reduced to the following procedure. Consecutively
for t=h+1,...,L-h the values of bt are replaced with 1, if
t h
i t h bi h .

0.
1.
2.
3.
4.

the description of CE
the count of speech signal parts – N
energy of speech signal – E
amplitude of CE - A
the frequency of crossing zero– Z

Otherwise the values of bt are replaced with zero.
1,

t h
i t h bi

h,

0,

t h
i t h bi

h,

bt

t

h

1 ,... L

Stage 4: At the following stage another corresponding
variants of each CE are created. In spite of the fact that it
increases the quantity of ASD elements, but at the same time it
makes possible to reduce the quantity of modules for
generation of a target signal.

h

(8)
t

h

1 ,... L

h.

As a result, the continuous parts containing speech are
determined. Then, we try to expand each part of this type. For
example, the part begins at the frame X (N1) and comes to an
end at the frame X (N2). Moving to the left from X(N1) (or to the
right from X (N2)) the algorithm compares the number of zeros
of intensity Zt with the threshold TZ. This moving should not
exceed 20 frames to the left of X (N1). If Zt has exceeded the
threshold three and more times, then the beginning of a speech
part is transferred to the place where Zt exceeds the threshold
for the first time. Otherwise the frame X(N1) could be
considered as the beginning of the speech part. The same is
valid for X(N2). If two parts are overlapped, they can be
combined into one part. Thus, the continuous parts containing
speech are finally determined. Such parts will be called
realizations of words.

Fig. 3. The structure of typical system
Fig. 1. Before application of the algorithm

These stages are used only in the beginning of process of
creating CE database for ASD. In the subsequent stages we do
not turn to them any more.

4
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

International Science Index 37, 2010 waset.org/publications/8303

The structure of the majority of systems of speech
synthesis, as well as the structure of our system of automatic
synthesis can be presented by the following flow chart (Fig. 3)
[26].
As may be seen from the shown diagram, there are two
blocks in our system: the block of linguistic processing and
the voicing module. At first, the input text is processed in the
Linguistic block and the obtained phonemic transcriptor is
passed to the second block, i.e., to the Voicing block of
system. In the Voicing block after certain stages the obtained
speech signal is sounded.

The following non-standard representations are described in
the Table 2.
TABLE II POSSIBLE TOKEN TYPE IN TEXT

TYPE
Decimal
Numbers
Ratios
Ordinal
numbers
Roman
Numerals
Alphanumeric
strings
Phone
Number:

4.1 Block of linguistic processing
4.1.1 Text input
The sounded text can be entered in any form. The size or font
type is of no importance. The main requirement is that the text
must be in Azerbaijani language.
4.1.2 Initial text processing
For forming of transcriptional record, the input text should
be shown as sequence of accentuated spelling words separated
by space and allowed punctuation marks. Such text can
conditionally be named as "normalized". Text normalization is
a very important issue in TTS systems. The general structure
of normalizer is explained in Figure 4. This module has
several stages as it is shown in the figure.
Stage 1: Spell-checking of the text
The spell-checkers are used in some cases (modules of
correction of spelling and punctuation errors). The module
helps to correct spelling errors in the text thereby to avoid
voicing of these errors.
Stage 2: A pre-processing module
A pre-processing module organizes the input sentences
into manageable lists of words. First, text normalization
isolates words in the text. For the most part this is as trivial as
looking for a sequence of alphabetic characters, allowing for
an occasional apostrophe and hyphen.
It identifies numbers, abbreviations, acronyms, idiomatics
and transforms them into full text when needed.
Stage 3: Number Expansion
Text normalization then searches for numbers, times,
dates, and other symbolic representations [27]. These are
analyzed and converted to words. (Example: "$54.32" is
converted to "fifty four dollars and thirty two cents, "$200"
appears in a document it may be spoken as "two hundred
dollars". Similarly, "1/2" may be spoken as "half", "January
second", "February first", "one of two") Someone needs to
code up the rules for the conversion of these symbols into
words, since they differ depending upon the language and
context.

Input text

Spellchecking
A pre-processing
Number Expansion
Punctuation analyze
NW-s observation
Disambiguation

Count:
Date :
Year
Time
Mathematical:

TEXT
1,2

SPEECH
One and two tenth

1/2
1-st

One second
first

VI, X

sixth, tenth

10a

Ten a power of a

(+9941
2)5813
433
25
01.11.1
999
1989
10:30
pm
2+1=3

Plus double nine, four, one,
two, five, eight, one, three,
four, double, three
Twenty five
First of November nineteen
ninety-nine
Nineteen eighty nine
Half past ten post meridiem
Two plus one is equal to three

Stage 4: Punctuation analyze
Whatever remains is punctuation. The normalizer will have
rules dictating if the punctuation causes a word to be spoken
or if it is silent. (Example: Periods at the end of sentences are
not spoken, but a period in an Internet address is spoken as
"dot.")
In normal writing, sentence boundaries are often signaled
by terminal punctuation from the set: full stop, exclamation
mark, question mark or comma {. ! ? ,} followed by white
spaces. In reading a long sentence, speakers will normally
break up the sentence into several phrases, each of which can
be said to stand alone as an intonation unit. If punctuation is
used liberally so that there are relatively few words between
the commas, semicolons or periods, then a reasonable guess at
an appropriate phrasing would be simply to break the sentence
at the punctuation marks though this is not always appropriate.
Hence, determining the sentence break and naming the type of
sentence has to be done so as to apply the prosodic rules.
In natural speech, speakers normally and naturally give
pauses between sentences. The average duration of pauses in a
natural speech has been observed and a lookup table (Table 3.)
is generated. Finally, the lookup table is used to insert pauses
between sentences that improve naturalness.
TABLE III OBSERVATION NATURAL SPEECH

Sentence Type
Affirmative (.)
Exclamatory (!)
Interrogative(?)
Comma (,)

List of word
in normalized
form

Fig. 4. Text Normalization System

5

Duration in seconds
1
0.9
0.8
0.5
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

This block creates information base for the creation of
phonemic transcription.

Text
Paragraph
Sentence
Word
Syllable
Letter

Stage 5: NSW-s observation
The aim of this stage is converting None Standard Words
(NSWs) into their standard word pronunciations. In Table 4.
some of NSWs are explained.

Speech
Phonoparagraph
Utterance
Phonowords
Diphone
Phoneme

TABLE IV NSW-S OBSERVATION

International Science Index 37, 2010 waset.org/publications/8303

NSW
Abbreviations
ASOA
AR
BSU
Internet addresses
http://www.Microsoft.com
Mail address
ibrahim@mail
Money
$
AZN

Fig. 5. Difference between text and speech

Their explained.

4.1.4 Formation of prosodies characteristics
The voice-frequency, accent and rhythmic characteristics
belong to the prosodies characteristics of utterance. The
frequency of the basic tone, energy and duration are their
physical analogues. These characteristics are informative for
formation of the control information for the subsequent
generation of an acoustic signal.
4.1.5 Creation of phonemic transcription (P )
Phonemic transcription forms the sound transcription relevant
to the inputted text, based on standard rules of reading in the
Azerbaijani language.
At this stage, it is necessary to assign each word of the text
(each word form) the data on its pronunciation, i.e. to
transform each word into a chain of phonemes or, in other
words, to create its phonemic transcription. In many languages,
as well as in Azerbaijani, there are sufficiently regular reading
rules, i.e., the rules of conformity between letters and
phonemes (sounds). The rules of reading are very irregular in
English language, and therefore the task of this block for
English synthesis becomes more complicated. In any case,
there are serious problems at definition of pronunciation of
proper names, the loanwords, new words, acronyms and
abbreviations.
It is not possible to simply store a transcription for all words
of the language because of great volume of the vocabulary and
because of contextual changes of a pronunciation of the same
word in a phrase. In addition, it is necessary to consider the
cases of graphic homonymy correctly: the same sequence of
alphabetic symbols in various contexts can represent two
different words/word forms and can be read differently. Often,
this problem can be solved by grammatical analysis; however,
sometimes only wider use of semantic information helps.

Azerbaijan State Oil Academy
Azerbaijan Republic
Baku State University
w w w dot Microsoft dot com
ibrahim at mail dot ru
Dollar
Manat

Once the text has been normalized and simplified into a
series of words, it is passed onto the next module, homograph
disambiguation.
Stage 6: Disambiguation
In some system disambiguation module is generally
handled by hand-crafted context-dependent rules [28].
However, such hand-crafted rules are very difficult to write,
maintain, and adapt to new domains.
The simple cases are the ones that can be disambiguated
within the word. In that case, the pronunciation can be
annotated in the dictionary, and so long as the word parsing is
correct, the right pronunciation will be chosen.
Currently we have only implemented the contextual
disambiguation. We will continue to implement other cases.
By the end of this step the text to be spoken has been
converted completely into tokens.
4.1.3 Linguistic analysis: the syntactic, morphemic
analysis
Linguistic analysis of the text takes place after the
normalization process. By using morphological, syntactic
characteristics of the Azerbaijani language the text is
partitioned into sub-layers.
Text and speech signal have clearly defined hierarchical
nature. In view of hierarchical representation, we can conclude
that for the qualitative construction of systems of speech
synthesis it is necessary to develop a model of mechanism of
speech formation. In the system, initially we should define the
flow of the information which should proceed according to the
scheme presented in Fig. 5.

4.2 The voicing block
4.2.1 An acoustic database: the retrieval in ASD
As it was already mentioned processing the EC and creation of
ASD takes place at the initial stage. But at the first stage of the
voicing block the retrieval is only made in ASD. The phonemic
chain is created by using generated phonemic transcription from
EC in ASD.
4.2.2 Calculation of acoustic parameters of a speech
signal
This block generates the created phonemic chain on the basis
of prosodic characteristics.

6
World Academy of Science, Engineering and Technology
International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010

The purpose of the rules of this block includes definition of
energy, time and voice-frequency characteristics that should be
assigned to the sound units forming a phonetic transcription of the
synthesized phrase.

The punctuations are removed in the preprocessing step just
to eliminate some inconsistencies and obtain the core system. In
the future versions of the TTS, the text can be synthesized in
accordance with the punctuations for considering the emotions
and intonations as partially achieved in some of the researches
[30]. The synthesis of a sentence ending with a question mark can
have an interrogative intonation and synthesis of a sentence
ending with an exclamation mark can be an amazing intonation.
In addition to these, other punctuations can be helpful for
approximating the synthesized speech to its human speech form
such as pausing at the end of the sentences ending with full stop
and also pausing after the punctuation comma.

4.2.3 Generation of a speech signal.
The joining of speech units occurs independently of the size of
EC. Thus, there can be rather sensitive distortions of a speech
signal for hearing. To prevent this effect, local smoothing of left
(i) and right (j) joined waves is carried out by the following
algorithm [29]:
1. From the last (zero) reading of left (ith) joined wave we count
the 3rd reading for which new average value Si3m is calculated
from values of ith and jth waves by the formula:

REFERENCES
[1]

Si3m = 1/9 * (Si7+ … +Si3 … + Si0+Sj0)

(9)

[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]

International Science Index 37, 2010 waset.org/publications/8303

2. Then, we reiterate the process according to the following
recurrent scheme until we receive the last new value for zero
reading of the ith wave:
Si2m = 1/9 * (Si6+ … +Si2 … + Sj0+Sj1)
Si1m = 1/9 * (Si5+ … +Si1 … + Sj1+Sj2)
Si0m = 1/9 * (Si4+ … +Si0 … + Sj2+Sj3)

(10)

3. Then, the new values of the jth wave are calculated:
Sj0m = 1/9 * (Si3+ … Sj0m … + Sj3+Sj4)
Sj1m = 1/9 * (Si2+ … Sj1m … + Sj4+Sj5)
Sj2m = 1/9 * (Si1+ … Sj2m … + Sj5+Sj6)

(11)

4. The process ends after reception of new value for the 4th
reading of the jth wave:
Sj3m = 1/9 * (Si0 + Sj0+ … Sj3m … + Sj6+Sj7)

(12)
[23]

4.2.4 Voicing of an output signal
Using available EC from the received sequence of speech
units is sounded.

[24]
[25]

V. CONCLUSION

[26]

On the abovementioned grounds, the voicing of words of any
text in Azerbaijani language is carried out with the help of a
limited base of EC.
In this study the framework of a TTS system for Azerbaijani
language is built. Although the system uses simple techniques it
provides promising results for Azerbaijani language, since the
selected approach, namely the concatenative method, is very well
suited for Azerbaijani language. The system can be improved by
improving the quality of the speech files recorded.
In particular, the work on intonation is not finished because
segmentation was made manually and there is noticeable noise in
voicing. It is planned to apply independent segmentation and to
improve the quality of synthesis in the future.

[27]
[28]
[29]

[30]

7

Lawrence R. Rabiner and Ronald W. Schafer , Introduction to Digital
Speech Processing , Now Publishers ,USA, 2007.
http://www.infovox.se
http://www.digital.com/
http://www.bell-labs.com/project/tts
http://www.ims.uni-stuttgart.de/phonetik/synthesis
http://www.ikp.uni-bonn.de/~tpo/Hadifix.html
http://www.et.tu-dresden.de/ita/ita.html
http://www.fb9-ti.uni-duisburg.de/demos/speech.html
http://www.labs.bt.com/innovate/speech/ laureate/index.htm
http://www.york.ac.uk/~lang4/Yorktalk.html
http://www.essex.ac.uk/speech/research/spruce/demo-1/demo-1.html
http://www.unil.ch/imm/docs/LAIP/LAIPTTS.html
http://www.icp.grenet.fr/ICP/index.uk.html
http://www.cnet.fr/cnet/lannion.html
http://lorien.die.upm.es/research/synthesis/synthesis.html
http://www.limsi.fr/Recherche/TLP/theme1.html
http://www.clab.ee.upatras.gr/
http://www.di.uoa.gr/speech/synthesis/demosthenes
http://demo.sakhr.com/tts/tts.asp
http://tcts.fpms.ac.be/synthesis/
http://fatih.edu.tr
R.Ayda-zade, A.M.Sharifova. “The analysis of approaches of computer
synthesis Azerbaijani speech”. Transactions of Azerbaijan National
Academy of sciences. ”Informatics and control problems”. Volume
XXVI, 2. Baku, 2006, p.227-231. (in Azerbaijani)
Mammadov N. The theoretical principles of Azerbaijan linguistics.
Baki: Maarif, 1971, 366 p. (in Azerbaijani)
Akhundov A. The phonetics of Azerbaijani language. Baki: Maarif,
1984, 392 p. (in Azerbaijani)
Sagisaka Y. Spoken Output Technologies. Overview// Survey of the
state of the art in human language technology. Cambridge, 1997.
Sharifova A.M The Computer Synhtesis of the Azerbaijani Speech /
(Azerbaijani). Application of information-communication technologies
in science and education. International conference. Baku, 2007. Volume
II, p. 47-52.
http://www.clsp.jhu.edu/ws99/projects/normal/
Yarowsky D., "Homograph Disambiguation in Text-to-Speech
Synthesis" 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz,
NY, pp. 244-247.
Lobanov B.M. Retrospective review of researches and workings out of
Laboratory of recognition and speech synthesis. “Automatic recognition
and speech synthesis”, I C NAS Belarus, Minsk, 2000.-S.6-23. (In
Russian)
Kurematsu, M., Hakura, J., Fujita, H., The Framework of the Speech
Communication System with Emotion Processing, Proceedings of the
6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge
Engineering and Data Bases, Corfu Island, Greece, February 16-19,
2007, 46-52

More Related Content

What's hot (20)

PPTX
Computational linguistics
AdnanBaloch15
 
PPTX
Computational linguistics
1101989
 
PDF
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Syeful Islam
 
PPTX
Computational linguistics
Vahid Saffarian
 
PDF
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
Syeful Islam
 
PPT
Sanskrit and Computational Linguistic
Jaganadh Gopinadhan
 
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
kevig
 
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
kevig
 
PDF
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
kevig
 
PPTX
COMPUTATIONAL LINGUISTICS
Rahul Motipalle
 
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
kevig
 
PDF
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
ijnlc
 
PDF
Efficient Intralingual Text To Speech Web Podcasting And Recording
IOSR Journals
 
PPTX
Language Grid
lindh
 
PPTX
C.s & c.m
Bilal Yaseen
 
PDF
BIDIRECTIONAL MACHINE TRANSLATION BETWEEN TURKISH AND TURKISH SIGN LANGUAGE: ...
ijnlc
 
PDF
Design and Implementation of a Language Assistant for English – Arabic Texts
IJCSIS Research Publications
 
PDF
The Impact of Technology (BBM & WhatsApp Application) on English Language Use...
DrAshraf Salem
 
PDF
A Computational Model of Yoruba Morphology Lexical Analyzer
Waqas Tariq
 
Computational linguistics
AdnanBaloch15
 
Computational linguistics
1101989
 
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Syeful Islam
 
Computational linguistics
Vahid Saffarian
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
Syeful Islam
 
Sanskrit and Computational Linguistic
Jaganadh Gopinadhan
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
kevig
 
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
kevig
 
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
kevig
 
COMPUTATIONAL LINGUISTICS
Rahul Motipalle
 
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
kevig
 
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS
ijnlc
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
IOSR Journals
 
Language Grid
lindh
 
C.s & c.m
Bilal Yaseen
 
BIDIRECTIONAL MACHINE TRANSLATION BETWEEN TURKISH AND TURKISH SIGN LANGUAGE: ...
ijnlc
 
Design and Implementation of a Language Assistant for English – Arabic Texts
IJCSIS Research Publications
 
The Impact of Technology (BBM & WhatsApp Application) on English Language Use...
DrAshraf Salem
 
A Computational Model of Yoruba Morphology Lexical Analyzer
Waqas Tariq
 

Viewers also liked (6)

PDF
Upfc supplementary-controller-design-using-real-coded-genetic-algorithm-for-d...
Cemal Ardil
 
PDF
Adaptive non-linear-filtering-technique-for-image-restoration
Cemal Ardil
 
PDF
Adaptive filters
Mustafa Khaleel
 
PPSX
Performance analysis of adaptive noise canceller for an ecg signal
Raj Kumar Thenua
 
PPT
Adaptive filter
Sivaranjan Goswami
 
PPTX
ECG Noise cancelling
salamy88
 
Upfc supplementary-controller-design-using-real-coded-genetic-algorithm-for-d...
Cemal Ardil
 
Adaptive non-linear-filtering-technique-for-image-restoration
Cemal Ardil
 
Adaptive filters
Mustafa Khaleel
 
Performance analysis of adaptive noise canceller for an ecg signal
Raj Kumar Thenua
 
Adaptive filter
Sivaranjan Goswami
 
ECG Noise cancelling
salamy88
 
Ad

Similar to The main-principles-of-text-to-speech-synthesis-system (20)

PDF
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
kevig
 
PDF
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
ijnlc
 
PDF
Development of text to speech system for yoruba language
Alexander Decker
 
PDF
Integration of Phonotactic Features for Language Identification on Code-Switc...
kevig
 
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
kevig
 
PDF
Language
Guido Wachsmuth
 
PDF
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
Nathan Mathis
 
PPTX
visH (fin).pptx
tefflontrolegdy
 
PDF
Survey On Speech Synthesis
CSCJournals
 
PDF
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
DOC
12EEE032- text 2 voice
Nsaroj kumar
 
PDF
Lit mtap
Andrea Ferracani
 
PDF
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
gerogepatton
 
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
gerogepatton
 
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
ijaia
 
PDF
American Standard Sign Language Representation Using Speech Recognition
paperpublications3
 
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
PDF
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET Journal
 
PDF
An expert system for automatic reading of a text written in standard arabic
ijnlc
 
PDF
Summer Research Project (Anusaaraka) Report
Anwar Jameel
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
kevig
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
ijnlc
 
Development of text to speech system for yoruba language
Alexander Decker
 
Integration of Phonotactic Features for Language Identification on Code-Switc...
kevig
 
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
kevig
 
Language
Guido Wachsmuth
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
Nathan Mathis
 
visH (fin).pptx
tefflontrolegdy
 
Survey On Speech Synthesis
CSCJournals
 
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
12EEE032- text 2 voice
Nsaroj kumar
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
gerogepatton
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
ijaia
 
American Standard Sign Language Representation Using Speech Recognition
paperpublications3
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
paperpublications3
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET Journal
 
An expert system for automatic reading of a text written in standard arabic
ijnlc
 
Summer Research Project (Anusaaraka) Report
Anwar Jameel
 
Ad

More from Cemal Ardil (20)

PDF
The feedback-control-for-distributed-systems
Cemal Ardil
 
PDF
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
Cemal Ardil
 
PDF
Sonic localization-cues-for-classrooms-a-structural-model-proposal
Cemal Ardil
 
PDF
Robust fuzzy-observer-design-for-nonlinear-systems
Cemal Ardil
 
PDF
Response quality-evaluation-in-heterogeneous-question-answering-system-a-blac...
Cemal Ardil
 
PDF
Reduction of-linear-time-invariant-systems-using-routh-approximation-and-pso
Cemal Ardil
 
PDF
Real coded-genetic-algorithm-for-robust-power-system-stabilizer-design
Cemal Ardil
 
PDF
Performance of-block-codes-using-the-eigenstructure-of-the-code-correlation-m...
Cemal Ardil
 
PDF
Optimal supplementary-damping-controller-design-for-tcsc-employing-rcga
Cemal Ardil
 
PDF
Optimal straight-line-trajectory-generation-in-3 d-space-using-deviation-algo...
Cemal Ardil
 
PDF
On the-optimal-number-of-smart-dust-particles
Cemal Ardil
 
PDF
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
Cemal Ardil
 
PDF
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
Cemal Ardil
 
PDF
On problem-of-parameters-identification-of-dynamic-object
Cemal Ardil
 
PDF
Numerical modeling-of-gas-turbine-engines
Cemal Ardil
 
PDF
New technologies-for-modeling-of-gas-turbine-cooled-blades
Cemal Ardil
 
PDF
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Cemal Ardil
 
PDF
Multivariate high-order-fuzzy-time-series-forecasting-for-car-road-accidents
Cemal Ardil
 
PDF
Multistage condition-monitoring-system-of-aircraft-gas-turbine-engine
Cemal Ardil
 
PDF
Multi objective-optimization-with-fuzzy-based-ranking-for-tcsc-supplementary-...
Cemal Ardil
 
The feedback-control-for-distributed-systems
Cemal Ardil
 
System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...
Cemal Ardil
 
Sonic localization-cues-for-classrooms-a-structural-model-proposal
Cemal Ardil
 
Robust fuzzy-observer-design-for-nonlinear-systems
Cemal Ardil
 
Response quality-evaluation-in-heterogeneous-question-answering-system-a-blac...
Cemal Ardil
 
Reduction of-linear-time-invariant-systems-using-routh-approximation-and-pso
Cemal Ardil
 
Real coded-genetic-algorithm-for-robust-power-system-stabilizer-design
Cemal Ardil
 
Performance of-block-codes-using-the-eigenstructure-of-the-code-correlation-m...
Cemal Ardil
 
Optimal supplementary-damping-controller-design-for-tcsc-employing-rcga
Cemal Ardil
 
Optimal straight-line-trajectory-generation-in-3 d-space-using-deviation-algo...
Cemal Ardil
 
On the-optimal-number-of-smart-dust-particles
Cemal Ardil
 
On the-joint-optimization-of-performance-and-power-consumption-in-data-centers
Cemal Ardil
 
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
Cemal Ardil
 
On problem-of-parameters-identification-of-dynamic-object
Cemal Ardil
 
Numerical modeling-of-gas-turbine-engines
Cemal Ardil
 
New technologies-for-modeling-of-gas-turbine-cooled-blades
Cemal Ardil
 
Neuro -fuzzy-networks-for-identification-of-mathematical-model-parameters-of-...
Cemal Ardil
 
Multivariate high-order-fuzzy-time-series-forecasting-for-car-road-accidents
Cemal Ardil
 
Multistage condition-monitoring-system-of-aircraft-gas-turbine-engine
Cemal Ardil
 
Multi objective-optimization-with-fuzzy-based-ranking-for-tcsc-supplementary-...
Cemal Ardil
 

Recently uploaded (20)

PDF
Introducing and Operating FME Flow for Kubernetes in a Large Enterprise: Expe...
Safe Software
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PDF
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
PDF
Sound the Alarm: Detection and Response
VICTOR MAESTRE RAMIREZ
 
PPTX
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
PDF
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Introducing and Operating FME Flow for Kubernetes in a Large Enterprise: Expe...
Safe Software
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
Understanding The True Cost of DynamoDB Webinar
ScyllaDB
 
Sound the Alarm: Detection and Response
VICTOR MAESTRE RAMIREZ
 
Securing Model Context Protocol with Keycloak: AuthN/AuthZ for MCP Servers
Hitachi, Ltd. OSS Solution Center.
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Modern Decentralized Application Architectures.pdf
Kalema Edgar
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Quantum Threats Are Closer Than You Think – Act Now to Stay Secure
WSO2
 
FME in Overdrive: Unleashing the Power of Parallel Processing
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 

The main-principles-of-text-to-speech-synthesis-system

  • 1. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 The Main Principles of Text-to-Speech Synthesis System .R. Aida–Zade, C. Ardil and A.M. Sharifova Such systems can be used in communication systems, in information referral systems, it can be applied to help people who lost seeing and reading ability, in acoustic dialogue of users with computer and in others fields. In general, synthesis of speech can be necessary in all the cases when the addressee of the information is a person. Abstract—In this paper, the main principles of text-to-speech synthesis system are presented. Associated problems which arise when developing speech synthesis system are described. Used approaches and their application in the speech synthesis systems for Azerbaijani language are shown. Keywords—synthesis of Azerbaijani language, morphemes, phonemes, sounds, sentence, speech synthesizer, intonation, accent, pronunciation. II. PREVIOUS WORKS The earliest efforts to produce synthetic speech date as far back as XVIII century. Despite the fact that the first attempts were in the form of mechanical machines, we can say today that these synthesizers were of a high quality. In 1779 in St. Petersburg, Russian Professor Christian Kratzenshtein explained physiological differences between five long vowels (/a/, /e/, /i/, /o/, and /u/) and made an apparatus to produce them artificially. In 1791 in Vienna, Wolfgang von Kempelen introduced his "Acoustic-Mechanical Speech Machine". In about mid 1800's Charles Wheatstone constructed his famous version of von Kempelen's speaking machine. There have been three generations of speech synthesis systems [1]. During the first generation (1962-1977) formant synthesis of phonemes was the dominant technology. This technology made use of the rules based on phonetic decomposition of sentence to formant frequency contours. The intelligibility and naturalness were poor in such synthesis. In the second generation of speech synthesis methods (from 1977 to 1992) the diphones were represented with the LPC parameters. It was shown that good intelligibility of synthetic speech could be reliably obtained from text input by concatenating the appropriate diphone units. The intelligibility improved over formant synthesis, but the naturalness of the synthetic speech remained low. The third generation of speech synthesis technology is the period from 1992 to the present day. This generation is marked by the method of “unit selection synthesis” which was introduced and perfected, by Sagisaka at ATR Labs. in Kyoto. The resulting synthetic speech of this period was close to humangenerated speech in terms of intelligibility and naturalness. Modern speech synthesis technologies involve quite complicated and sophisticated methods and algorithms. “Infovox” [2] speech synthesizer family is perhaps one of the best known multilingual text-to-speech products available today. The first commercial version, Infovox SA-101, was developed in Sweden at the Royal Institute of Technology in 1982 and it is based on formant synthesis. The latest full commercial version, Infovox 230, is available for American and British English, Danish, Finnish, French, German, Icelandic, Italian, Norwegian, Spanish, Swedish, and Dutch. I. INTRODUCTION International Science Index 37, 2010 waset.org/publications/8303 I N the XXI century widespread use of computers opened a new stage in information interchange between the user and the computer. Among other things, an opportunity to input the information to computer through speech, and to reproduce in voice text information stored in the computer have been made possible. The paper is dedicated to the second part of this issue, i.e. to the computer-aided text-to-speech synthesis, that is recognized as a very urgent problem. Nowadays the solution of this problem can be applied in various fields. First of all, it would be of great importance for people with weak eyesight. In the modern world, it is practically impossible to live without an information exchange. The people with weak eyesight face with big problems while receiving the information through reading. A lot of methods are used to solve this problem. For example, the sound version of some books is created. As a result, people with weak eyesight have an opportunity to receive the information by listening. But there can be a case when the sound version of the necessary book couldn’t be found. Therefore, the implementation of the speech technologies for information exchange for users with weak eyesight is of a crucial necessity. In fact, computer synthesis of speech opens a new direction for an information transfer through the computer. For today it is mainly possible through the monitor. Synthesis of speech is the transformation of the text to speech. This transformation is converting the text to the synthetic speech that is as close to real speech as possible in compliance with the pronunciation norms of special language. TTS is intended to read electronic texts in the form of a book, and also to vocalize texts with the use of speech synthesis. When developing our system not only widely known modern methods but also a new approach of processing speech signal was used. Kamil Aida-Zade is with the institute of Cybernetics of the National Academy of Sciences, Baku, Azerbaijan. Cemal Ardil is with the National Academy of Aviation, Baku, Azerbaijan. Aida Sharifova is with the institute of Cybernetics of the National Academy of Sciences, Baku, Azerbaijan. 1
  • 2. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 Digital Equipment Corporation [3] (DEC) talk system is originally descended from MITalk and Klattalk. The present version of the system is available for American English, German and Spanish and offers nine different voice personalities, four male, four female and one child. The present DECtalk system is based on digital formant synthesis. AT&T Bell Laboratories [4] (Lucent Technologies) also has very long traditions with speech synthesis. The first full textto-speech (TTS) system was demonstrated in Boston 1972 and released in 1973. It was based on articulatory model developed by Cecil Coker (Klatt 1987). The development process of the present concatenative synthesis system was started by Joseph Olive in mid 1970's (Bell Labs 1997). The current system is available for English, French, Spanish, Italian, German, Russian, Romanian, Chinese, and Japanese (Mcbius et al. 1996). Currently, different research teams are working at this problem. Some of them are described in the Table 1. MBROLA [20] Turkish Synthesis TTTS [21] International Science Index 37, 2010 waset.org/publications/8303 Hadifix [6] Waveform Synthesis[7] Multilingual TTS system [8] English Synthesis Laureate [9] YorkTalk [10] SPRUCE [11] French Synthesis SpeechMill [12] IMS Uni Stuttgart IKP Uni Bonn TU Dresden TI Uni Duisburg British Telecom University of York Essex Speech Group ICP [13] The University of Lausanne Grenoble CNET [14] Lannion Spanish Synthesis University of Madrid [15] LIMSI TTS System LIMSI[16] CNRS Greek Synthesis University of Patras [17] DEMOSTHeNES[18] Arabic Synthesis Sakhr Software [19] syllable-based concatenative Text to speech synthesis is converting the text to the synthetic speech that is as close to real speech as possible according to the pronunciation norms of special language. Such systems are called text to speech (TTS) systems. Input element of TTS system is a text, output element is synthetic speech. There are two possible cases. When it is necessary to pronounce the limited number of phrases (and their pronouncing linearly does not vary), the necessary speech material is simply recorded in advance. In this case, certain problems are originated. For example in this approach, it is not possible to sound the text, which is not known in advance. For this purpose the pronounced text has to be kept in computer memory. And it will lead to increase of the size of memory required for information content. This will bring to essential load of computer memory in case of much information and can create certain problems in operation. The main approach used in this paper is voicing of previously unknown text based on a specific algorithm. It is necessary to note that the approach to solving problem of speech synthesis essentially depends on the language for which it will be used [22], and that the majority of currently available synthesizers basically were generated for UKEnglish, Spanish and Russian languages, and these synthesizers had not been applied to the Azerbaijani language yet. Azerbaijani language like as Turkish is an agglutinative language. In the view of the specificity of the Azerbaijani language, the special approach is required. Every language has its own unique features. For example: there are certain contradictions between letters and certain sounds in English language. Thus, two different letters coming together, sound differently than when they are used separately. For example: letters (t), (h) separately do not sound the same as in chain (th). This is only one of problems faced in English language. In other words, the place of the letters affect on how they should be or should not be pronounced. Thus, according to the phonetic rules of English language the first letter (k) of the word (know) is not pronounced. As well, Russian language has certain pronunciation features. First of all, it should be noted that the letter (o) does not always pronounced like sound (o). There are some features unit selection non-segemental (formant) synthesis high level synthesis (using other synthesis backends) diphone synthesis diphone synthesis diphone synthesis concatenative and formant synthesis diphone synthesis diphone synthesis diphone synthesis Cairo, Egypt Fatih University III. SPECIFICATION OF LANGUAGE RESOURCES FOR SPEECH SYNTHESIS diphone synthesis mixed inventory waveform synthesis formant synthesis University of Athens synthesis diphone synthesis Despite existing various approaches, it is still difficult to tell which of these approaches is more suitable or more useful. In general, the problems faced during text-to-speech synthesis depend on many factors: diversity of languages, specificity of the pronunciation, accent, stress, intonation etc. TABLE I TTS SYSTEMS German Synthesis German Festival [5] Le Mons, Belgium concatenative 2
  • 3. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 International Science Index 37, 2010 waset.org/publications/8303 based on phonetic rules of Russian language. For example: the first letter (o) in the word (korova) is pronounced like (a). Moreover, if we take into consideration that the letter ( ) is not pronounced at all and only gives softness to the pronounced word. From the foresaid it is clear that it the synthesizer programs developed especially for one language cannot be used in a different language, because the specificity of one language is not presumably typical for the others. Each program is based on algorithms corresponding to the phonetic rules of certain language. Till now there are no any programs of a synthesizer type that take into consideration the specificity of Azerbaijani language. The Azerbaijani language has its specific features [23]. Some words aren’t pronounced as its written form in Azeri. For example, the Azeri word “ail ” is pronounced like [ayil ], “Aid ” like [Ayid ], “mü llim” like [m :lim]. As it is shown the sound “y” is added to the first and second words, the sounds “ü” and “l” aren’t pronounced in the second word. Here is another example: the word “toqqa” is pronounced like [tokqa], here the first sound “q” is changed into “k”. necessary units from available acoustic units. Concatenation of segments of written speech lays in the basis of concatenate synthesis. As a rule, concatenate synthesis gives naturalness to sounding of the synthesized speech. Nevertheless, the natural fluctuations in speeches and the automated technologies of segmentation of speech signals create noise in the generated fragment and this decreases the naturalness of sounding. An acoustic signal database (ASD), which consists of fragments of a real acoustic signal, i.e. the elements of concatenation (EC), is the basis of any system of synthesis of the speech based on concatenation method. Dimension of these elements can be various depending on a concrete way of synthesis of speech, it can be phonemes, allophones, syllables, diaphones, words etc [24]. In the system developed by us the elements of concatenation are diaphones and various combinations of vowels. But it is necessary to note that the study of generation of one-syllabic words consisting of four letters (stol, dörd) is still underway and that is why this words are included into the base as indivisible units. The speech units used in creation ASD are saved in WAV format. The process of creation of ASD consists of the following stages: Stage 1: In the initial stage, the speech database is created on the basis of the main speech units of the donor speaker. Stage 2: The speech units from speaker’s speech are processed before being added into database. It’s done in the following steps: a) Speech signals were sampled at 16 kHz and it makes possible to define the period T with a precision of 10-4. b) Removal of surrounding noise from the recorded speech units. For this purpose we use the algorithm of division of a phrase realization into speech and pauses. It is supposed, that the first 10 frames do not contain a speech signal. For this part of signal we calculate mean value and dispersion of Et and Zt and obtain statistical characteristics of noise [25]. IV. USED APPROACH Two parameters, naturalness of sounding and intelligibility of speech, are applied for the assessment of the quality of synthesis system. One can say that naturalness of sounding of a speech synthesizer depends on how many generated sounds are close to natural human speech. By a intelligibility (ease for understanding) of a speech synthesizer is meant the easiness of artificial speech understanding. The ideal speech synthesizer should possess both characteristics: naturalness of sounding and intelligibility. Existing and being developed systems for speech synthesis are aimed at improvement of these two characteristics. The idea of combination of concatenation methods and of formant synthesis is the fundament of the system we developed. The rough, primary basis of a formed acoustic signal is created on the basis of concatenation of the fragments of an acoustic signal taken from speech of the speaker, i.e., a "donor". Then, this acoustic database is changed by the rules. The purpose of these rules is to give the necessary prosodies characteristics (frequency of the basic tone, duration and energy) to the "stuck together" fragments of an acoustic signal. The method of concatenation together with an adequate set of base elements of compilation provides for qualitative reproduction of spectral characteristics of a speech signal, and the set of rules provides for the possibility of generating natural intonation-prosodial mode of pronouncements. Formant synthesis does not use any samples of human speech. On the contrary, the speech message of the synthesized speech is created by means of acoustic model. Such parameters as own frequency, sounding and noise levels vary in order to generate natural form of a signal of artificial speech. In systems of concatenate synthesis (earlier it was called compilation), synthesis is carried out by sticking together m s 2 ( n) p E s ( m) (1) n m L 1 1 Z s (m) sgn(s p (n)) sgn(s p (n 1)) m Ln m L 1 2 (2) where sgn( s p ( n )) 1, s p ( n ) 1, s p ( n ) 0, (3) 0. where L is the number of frames of speech signal. Then taking into account these characteristics and maximal values of Et, Zt for realization of a given phrase we calculate the threshold TE for short-time energy of a signal and the threshold TZ for number of zeros of signal intensity. The following formulas have been chosen experimentally: 3
  • 4. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 TE 3 D ( E ,10 ) M ( E ,10 ) TZ M ( Z ,10 ) k11maxL E t t (4) 3 D ( Z ,10 ) k 2 1maxL Z t t (5) where 1 n M (P, n) n t 1 D (P, n) n International Science Index 37, 2010 waset.org/publications/8303 n 1 (6) Pt 1t 1 ( Pt M ( P , n )) 2 Fig. 2. After application of the algorithm As may be seen from the figures (Figure 1, 2), the continuous parts containing speech are finally determined after the algorithm application. Stage 3: As described above, the ASD plays the main role in speech synthesis. The information stored in AED is used in different modules of synthesis. In our system, CE is stored in .wav format, with 16 kHz frequency. Each wav file includes the following elements: (7) If a frame X(t) contains speech, then we assign binary variable bt to 1, otherwise we assign bt to 0. At first, it is necessary to assign the value 1 to the frames with short-time energy Et TE, and to assign the value 0 to the other frames. Variables bt can accept only two values. Therefore the filtration is reduced to the following procedure. Consecutively for t=h+1,...,L-h the values of bt are replaced with 1, if t h i t h bi h . 0. 1. 2. 3. 4. the description of CE the count of speech signal parts – N energy of speech signal – E amplitude of CE - A the frequency of crossing zero– Z Otherwise the values of bt are replaced with zero. 1, t h i t h bi h, 0, t h i t h bi h, bt t h 1 ,... L Stage 4: At the following stage another corresponding variants of each CE are created. In spite of the fact that it increases the quantity of ASD elements, but at the same time it makes possible to reduce the quantity of modules for generation of a target signal. h (8) t h 1 ,... L h. As a result, the continuous parts containing speech are determined. Then, we try to expand each part of this type. For example, the part begins at the frame X (N1) and comes to an end at the frame X (N2). Moving to the left from X(N1) (or to the right from X (N2)) the algorithm compares the number of zeros of intensity Zt with the threshold TZ. This moving should not exceed 20 frames to the left of X (N1). If Zt has exceeded the threshold three and more times, then the beginning of a speech part is transferred to the place where Zt exceeds the threshold for the first time. Otherwise the frame X(N1) could be considered as the beginning of the speech part. The same is valid for X(N2). If two parts are overlapped, they can be combined into one part. Thus, the continuous parts containing speech are finally determined. Such parts will be called realizations of words. Fig. 3. The structure of typical system Fig. 1. Before application of the algorithm These stages are used only in the beginning of process of creating CE database for ASD. In the subsequent stages we do not turn to them any more. 4
  • 5. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 International Science Index 37, 2010 waset.org/publications/8303 The structure of the majority of systems of speech synthesis, as well as the structure of our system of automatic synthesis can be presented by the following flow chart (Fig. 3) [26]. As may be seen from the shown diagram, there are two blocks in our system: the block of linguistic processing and the voicing module. At first, the input text is processed in the Linguistic block and the obtained phonemic transcriptor is passed to the second block, i.e., to the Voicing block of system. In the Voicing block after certain stages the obtained speech signal is sounded. The following non-standard representations are described in the Table 2. TABLE II POSSIBLE TOKEN TYPE IN TEXT TYPE Decimal Numbers Ratios Ordinal numbers Roman Numerals Alphanumeric strings Phone Number: 4.1 Block of linguistic processing 4.1.1 Text input The sounded text can be entered in any form. The size or font type is of no importance. The main requirement is that the text must be in Azerbaijani language. 4.1.2 Initial text processing For forming of transcriptional record, the input text should be shown as sequence of accentuated spelling words separated by space and allowed punctuation marks. Such text can conditionally be named as "normalized". Text normalization is a very important issue in TTS systems. The general structure of normalizer is explained in Figure 4. This module has several stages as it is shown in the figure. Stage 1: Spell-checking of the text The spell-checkers are used in some cases (modules of correction of spelling and punctuation errors). The module helps to correct spelling errors in the text thereby to avoid voicing of these errors. Stage 2: A pre-processing module A pre-processing module organizes the input sentences into manageable lists of words. First, text normalization isolates words in the text. For the most part this is as trivial as looking for a sequence of alphabetic characters, allowing for an occasional apostrophe and hyphen. It identifies numbers, abbreviations, acronyms, idiomatics and transforms them into full text when needed. Stage 3: Number Expansion Text normalization then searches for numbers, times, dates, and other symbolic representations [27]. These are analyzed and converted to words. (Example: "$54.32" is converted to "fifty four dollars and thirty two cents, "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two") Someone needs to code up the rules for the conversion of these symbols into words, since they differ depending upon the language and context. Input text Spellchecking A pre-processing Number Expansion Punctuation analyze NW-s observation Disambiguation Count: Date : Year Time Mathematical: TEXT 1,2 SPEECH One and two tenth 1/2 1-st One second first VI, X sixth, tenth 10a Ten a power of a (+9941 2)5813 433 25 01.11.1 999 1989 10:30 pm 2+1=3 Plus double nine, four, one, two, five, eight, one, three, four, double, three Twenty five First of November nineteen ninety-nine Nineteen eighty nine Half past ten post meridiem Two plus one is equal to three Stage 4: Punctuation analyze Whatever remains is punctuation. The normalizer will have rules dictating if the punctuation causes a word to be spoken or if it is silent. (Example: Periods at the end of sentences are not spoken, but a period in an Internet address is spoken as "dot.") In normal writing, sentence boundaries are often signaled by terminal punctuation from the set: full stop, exclamation mark, question mark or comma {. ! ? ,} followed by white spaces. In reading a long sentence, speakers will normally break up the sentence into several phrases, each of which can be said to stand alone as an intonation unit. If punctuation is used liberally so that there are relatively few words between the commas, semicolons or periods, then a reasonable guess at an appropriate phrasing would be simply to break the sentence at the punctuation marks though this is not always appropriate. Hence, determining the sentence break and naming the type of sentence has to be done so as to apply the prosodic rules. In natural speech, speakers normally and naturally give pauses between sentences. The average duration of pauses in a natural speech has been observed and a lookup table (Table 3.) is generated. Finally, the lookup table is used to insert pauses between sentences that improve naturalness. TABLE III OBSERVATION NATURAL SPEECH Sentence Type Affirmative (.) Exclamatory (!) Interrogative(?) Comma (,) List of word in normalized form Fig. 4. Text Normalization System 5 Duration in seconds 1 0.9 0.8 0.5
  • 6. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 This block creates information base for the creation of phonemic transcription. Text Paragraph Sentence Word Syllable Letter Stage 5: NSW-s observation The aim of this stage is converting None Standard Words (NSWs) into their standard word pronunciations. In Table 4. some of NSWs are explained. Speech Phonoparagraph Utterance Phonowords Diphone Phoneme TABLE IV NSW-S OBSERVATION International Science Index 37, 2010 waset.org/publications/8303 NSW Abbreviations ASOA AR BSU Internet addresses http://www.Microsoft.com Mail address ibrahim@mail Money $ AZN Fig. 5. Difference between text and speech Their explained. 4.1.4 Formation of prosodies characteristics The voice-frequency, accent and rhythmic characteristics belong to the prosodies characteristics of utterance. The frequency of the basic tone, energy and duration are their physical analogues. These characteristics are informative for formation of the control information for the subsequent generation of an acoustic signal. 4.1.5 Creation of phonemic transcription (P ) Phonemic transcription forms the sound transcription relevant to the inputted text, based on standard rules of reading in the Azerbaijani language. At this stage, it is necessary to assign each word of the text (each word form) the data on its pronunciation, i.e. to transform each word into a chain of phonemes or, in other words, to create its phonemic transcription. In many languages, as well as in Azerbaijani, there are sufficiently regular reading rules, i.e., the rules of conformity between letters and phonemes (sounds). The rules of reading are very irregular in English language, and therefore the task of this block for English synthesis becomes more complicated. In any case, there are serious problems at definition of pronunciation of proper names, the loanwords, new words, acronyms and abbreviations. It is not possible to simply store a transcription for all words of the language because of great volume of the vocabulary and because of contextual changes of a pronunciation of the same word in a phrase. In addition, it is necessary to consider the cases of graphic homonymy correctly: the same sequence of alphabetic symbols in various contexts can represent two different words/word forms and can be read differently. Often, this problem can be solved by grammatical analysis; however, sometimes only wider use of semantic information helps. Azerbaijan State Oil Academy Azerbaijan Republic Baku State University w w w dot Microsoft dot com ibrahim at mail dot ru Dollar Manat Once the text has been normalized and simplified into a series of words, it is passed onto the next module, homograph disambiguation. Stage 6: Disambiguation In some system disambiguation module is generally handled by hand-crafted context-dependent rules [28]. However, such hand-crafted rules are very difficult to write, maintain, and adapt to new domains. The simple cases are the ones that can be disambiguated within the word. In that case, the pronunciation can be annotated in the dictionary, and so long as the word parsing is correct, the right pronunciation will be chosen. Currently we have only implemented the contextual disambiguation. We will continue to implement other cases. By the end of this step the text to be spoken has been converted completely into tokens. 4.1.3 Linguistic analysis: the syntactic, morphemic analysis Linguistic analysis of the text takes place after the normalization process. By using morphological, syntactic characteristics of the Azerbaijani language the text is partitioned into sub-layers. Text and speech signal have clearly defined hierarchical nature. In view of hierarchical representation, we can conclude that for the qualitative construction of systems of speech synthesis it is necessary to develop a model of mechanism of speech formation. In the system, initially we should define the flow of the information which should proceed according to the scheme presented in Fig. 5. 4.2 The voicing block 4.2.1 An acoustic database: the retrieval in ASD As it was already mentioned processing the EC and creation of ASD takes place at the initial stage. But at the first stage of the voicing block the retrieval is only made in ASD. The phonemic chain is created by using generated phonemic transcription from EC in ASD. 4.2.2 Calculation of acoustic parameters of a speech signal This block generates the created phonemic chain on the basis of prosodic characteristics. 6
  • 7. World Academy of Science, Engineering and Technology International Journal of Computer, Information Science and Engineering Vol:4 No:1, 2010 The purpose of the rules of this block includes definition of energy, time and voice-frequency characteristics that should be assigned to the sound units forming a phonetic transcription of the synthesized phrase. The punctuations are removed in the preprocessing step just to eliminate some inconsistencies and obtain the core system. In the future versions of the TTS, the text can be synthesized in accordance with the punctuations for considering the emotions and intonations as partially achieved in some of the researches [30]. The synthesis of a sentence ending with a question mark can have an interrogative intonation and synthesis of a sentence ending with an exclamation mark can be an amazing intonation. In addition to these, other punctuations can be helpful for approximating the synthesized speech to its human speech form such as pausing at the end of the sentences ending with full stop and also pausing after the punctuation comma. 4.2.3 Generation of a speech signal. The joining of speech units occurs independently of the size of EC. Thus, there can be rather sensitive distortions of a speech signal for hearing. To prevent this effect, local smoothing of left (i) and right (j) joined waves is carried out by the following algorithm [29]: 1. From the last (zero) reading of left (ith) joined wave we count the 3rd reading for which new average value Si3m is calculated from values of ith and jth waves by the formula: REFERENCES [1] Si3m = 1/9 * (Si7+ … +Si3 … + Si0+Sj0) (9) [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] International Science Index 37, 2010 waset.org/publications/8303 2. Then, we reiterate the process according to the following recurrent scheme until we receive the last new value for zero reading of the ith wave: Si2m = 1/9 * (Si6+ … +Si2 … + Sj0+Sj1) Si1m = 1/9 * (Si5+ … +Si1 … + Sj1+Sj2) Si0m = 1/9 * (Si4+ … +Si0 … + Sj2+Sj3) (10) 3. Then, the new values of the jth wave are calculated: Sj0m = 1/9 * (Si3+ … Sj0m … + Sj3+Sj4) Sj1m = 1/9 * (Si2+ … Sj1m … + Sj4+Sj5) Sj2m = 1/9 * (Si1+ … Sj2m … + Sj5+Sj6) (11) 4. The process ends after reception of new value for the 4th reading of the jth wave: Sj3m = 1/9 * (Si0 + Sj0+ … Sj3m … + Sj6+Sj7) (12) [23] 4.2.4 Voicing of an output signal Using available EC from the received sequence of speech units is sounded. [24] [25] V. CONCLUSION [26] On the abovementioned grounds, the voicing of words of any text in Azerbaijani language is carried out with the help of a limited base of EC. In this study the framework of a TTS system for Azerbaijani language is built. Although the system uses simple techniques it provides promising results for Azerbaijani language, since the selected approach, namely the concatenative method, is very well suited for Azerbaijani language. The system can be improved by improving the quality of the speech files recorded. In particular, the work on intonation is not finished because segmentation was made manually and there is noticeable noise in voicing. It is planned to apply independent segmentation and to improve the quality of synthesis in the future. [27] [28] [29] [30] 7 Lawrence R. Rabiner and Ronald W. Schafer , Introduction to Digital Speech Processing , Now Publishers ,USA, 2007. http://www.infovox.se http://www.digital.com/ http://www.bell-labs.com/project/tts http://www.ims.uni-stuttgart.de/phonetik/synthesis http://www.ikp.uni-bonn.de/~tpo/Hadifix.html http://www.et.tu-dresden.de/ita/ita.html http://www.fb9-ti.uni-duisburg.de/demos/speech.html http://www.labs.bt.com/innovate/speech/ laureate/index.htm http://www.york.ac.uk/~lang4/Yorktalk.html http://www.essex.ac.uk/speech/research/spruce/demo-1/demo-1.html http://www.unil.ch/imm/docs/LAIP/LAIPTTS.html http://www.icp.grenet.fr/ICP/index.uk.html http://www.cnet.fr/cnet/lannion.html http://lorien.die.upm.es/research/synthesis/synthesis.html http://www.limsi.fr/Recherche/TLP/theme1.html http://www.clab.ee.upatras.gr/ http://www.di.uoa.gr/speech/synthesis/demosthenes http://demo.sakhr.com/tts/tts.asp http://tcts.fpms.ac.be/synthesis/ http://fatih.edu.tr R.Ayda-zade, A.M.Sharifova. “The analysis of approaches of computer synthesis Azerbaijani speech”. Transactions of Azerbaijan National Academy of sciences. ”Informatics and control problems”. Volume XXVI, 2. Baku, 2006, p.227-231. (in Azerbaijani) Mammadov N. The theoretical principles of Azerbaijan linguistics. Baki: Maarif, 1971, 366 p. (in Azerbaijani) Akhundov A. The phonetics of Azerbaijani language. Baki: Maarif, 1984, 392 p. (in Azerbaijani) Sagisaka Y. Spoken Output Technologies. Overview// Survey of the state of the art in human language technology. Cambridge, 1997. Sharifova A.M The Computer Synhtesis of the Azerbaijani Speech / (Azerbaijani). Application of information-communication technologies in science and education. International conference. Baku, 2007. Volume II, p. 47-52. http://www.clsp.jhu.edu/ws99/projects/normal/ Yarowsky D., "Homograph Disambiguation in Text-to-Speech Synthesis" 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz, NY, pp. 244-247. Lobanov B.M. Retrospective review of researches and workings out of Laboratory of recognition and speech synthesis. “Automatic recognition and speech synthesis”, I C NAS Belarus, Minsk, 2000.-S.6-23. (In Russian) Kurematsu, M., Hakura, J., Fujita, H., The Framework of the Speech Communication System with Emotion Processing, Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, Corfu Island, Greece, February 16-19, 2007, 46-52