0% found this document useful (0 votes)
26 views31 pages

Deep Learning Techniques in Tandem With Signal Processing For Phonetic Segmentation

Uploaded by

samyakkatiyar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views31 pages

Deep Learning Techniques in Tandem With Signal Processing For Phonetic Segmentation

Uploaded by

samyakkatiyar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Deep Learning Techniques in Tandem with Signal

Processing Cues for Phonetic Segmentation for Text to


Speech Synthesis in Indian Languages

Jitesh Kumawat (22/301)


Chitransh Porwal (22/277)
Ayush Sharma (23/304)
Vikram Singh Fojdar (23/741)

Under the Guidance of


Prof(Dr). Harish Sharma, HOD

Dept. of Computer Science and Engineering


Rajasthan Technical University, Kota
Group 01 (RTU) BTech Project August 28, 2024 1 / 22
Overview

1 Introduction

2 Leveraging Signal Processing Cues

3 Proposed system

4 Segmentation

5 Experiments and Results

6 Conclusion

7 References

Group 01 (RTU) BTech Project August 28, 2024 2 / 22


Introduction

In India, spoken languages belong to several language families, the major ones being
Indo-Aryan and Dravidian.
Nevertheless, the amount of available digital resources in terms of parallel speech and text
corpora is very small [Post et al., 2012,Joy et al., 2014]. There is no single language
spoken in the entire country. Hence, in the context of speech synthesis, separate TTS
systems are needed for these languages
Objective of Phonetic Segmentation:
Accurate phonetic segmentation is essential for building high-quality
text-to-speech (TTS) synthesis systems.
Proper segmentation of speech into phonetic units improves the
naturalness and intelligibility of synthesized speech.
Challenges in Indian Languages:
Indian languages have complex phonetic and prosodic structures,
making accurate segmentation challenging.
Traditional phonetic segmentation methods struggle with the diversity
and richness of Indian languages’ phonetic inventories.

Group 01 (RTU) BTech Project August 28, 2024 3 / 22


Limitations of GMM-HMM Segmentation

1. Inaccurate Boundaries
The main drawback of the GMM-HMM based forced-alignment is that the phoneme boundaries
are not explicitly modeled, resulting in inaccurate boundary locations.

2. Manual Correction Required


The HMM-based forced alignment often requires manual correction after the segmentation,
which is time-consuming and error-prone for building large multi-lingual TTS corpus.

3. Lack of Labeled Data


For Indian languages, hand-labeled data like the TIMIT corpus is not readily available, making it
challenging to train machine learning models to refine the HMM boundaries.

Group 01 (RTU) BTech Project August 28, 2024 4 / 22


Leveraging Signal Processing Cues

Accurate phonetic segmentation becomes a problem when only the phoneme sequences
are available and not their boundary locations. For syllable-timed languages, signal
processing cues that are agnostic to the speaker can be used to get syllable boundaries

Boundary Refinement
For syllable-timed languages, minimum phase group delay (GD) based
processing of short-term energy (STE) is used for obtaining syllable
boundaries
Although GD based segmentation gives accurate syllable boundaries, it
introduces a number of spurious boundaries for syllables starting or
ending with a fricative or nasal, and syllables that start with a
semivowel or affricate

Group 01 (RTU) BTech Project August 28, 2024 5 / 22


Contd....

Additionally, spectral flux is used to address the issues of inaccurate


syllable boundaries in the context of fricatives, affricates, nasals and
semivowels. A modified version of spectral flux called sub-band spectral
flux (SBSF) is used as a cue for boundary correction
Boundary Correction Rules :-
The correction of the boundary between two syllables, syllable 1 and
syllable 2, is hence performed on the basis of end phone of syllable 1
and start phone of syllable 2. The following correction rules are applied
for obtaining accurate syllable boundaries:
Rule 1: The boundary between syllable 1 and syllable 2 is corrected
using STE if syllable 1 does not end with a fricative or nasal and
syllable 2 does not begin with a fricative, affricate, nasal or semi-vowel.

Rule 2: The boundary between syllable 1 and syllable 2 is corrected


using SBSF if either the end phone of syllable 1 or the start phone of
syllable 2 is a fricative or an affricate, but not both.

Group 01 (RTU) BTech Project August 28, 2024 6 / 22


Contd...

Figure: Boundary correction using STE and SBSF.


source:-https://www.iitm.ac.in/donlab/is2017/seg.php.
Group 01 (RTU) BTech Project August 28, 2024 7 / 22
Proposed system

Figure: Block diagram of proposed system


Group 01 (RTU) BTech Project August 28, 2024 8 / 22
Proposed system

Block 1
Syllable boundary correction using signal processing cues (GD of STE, and SBSF)after
GMM-HMM flat start initialisation is shown in this block

Block 2
Acoustic models are built by training the neural networks with the GMM-HMM monophone
alignment (also known as HMM-based phone alignment) as the initial alignment. The
DNN-HMM/CNN-HMM are then trained iteratively to get accurate final phone boundaries.

Block 3
The proposed framework, where the boundaries obtained using DNNs/CNNs are further
corrected using signal processing cues is shown in Block III

Group 01 (RTU) BTech Project August 28, 2024 9 / 22


Segmentation

Segmentation of speech data is performed at phone level using the following methods:

1 DNN-HMM without any boundary correction

2 CNN-HMM without any boundary correction

3 DNN-HMM with boundary correction based on signal processing cues

4 CNN-HMM with boundary correction based on signal processing cues

Group 01 (RTU) BTech Project August 28, 2024 10 / 22


Algorithm 1 :- DNN-HMM segmentation
1. Input Features: 40 dimensional filter bank features are used as input for DNN. The features
are spliced over 11 frames to add context information to DNN.

2. RBM Pretraining (6 layers):


A layer by layer training of RBM is performed.
The first layer of RBM is a Gaussian-Bernoulli layer and is trained with an initial learning
rate of 0.01.
The rest of the layers are Bernoulli-Bernoulli layers and are trained with an initial learning
rate of 0.4. .

3. DNN Training :
The DNN weights are layer by layer initialized with the pre-trained RBM weights.
The DNN is trained using stochastic gradient descent using back propagation.
A mini-batch size of 256 is used for training.
After each epoch, the network is tested on the error-validation data to determine whether
to accept or reject the model. If the model is rejected the learning rate is halved for the
next epoch.

Group 01 (RTU) BTech Project August 28, 2024 11 / 22


Segmentation

source:-https://www.iitm.ac.in/donlab/is2017/seg.php.

Group 01 (RTU) BTech Project August 28, 2024 12 / 22


Segmentation
Algorithm 2:- CNN-HMM segmentation
1. Input Features: 40 dimensional filter bank features with 3 pitch coefficients are given as
input to the network. The features are spliced over 11 frames to add context information into
CNN training

2. Convolutional layer:

Two convolutional layers are used with 1024 node in each layer.
The convolutional window is of dimension 8.
A pooling window of size 3 and no overlap of pooling window is used in
pooling layer.
The CNN layer used a feature map number of 256 and 128 for first and
second convolutional layer respectively.
3. Fully connected layer:

4 fully-connected layer with 1024 nodes in each hidden layer is used.


The fully connected layer are trained by first performing a pre-training and
followed by iterative training using the features extracted through CNN layer.

Group 01 (RTU) BTech Project August 28, 2024 13 / 22


Segmentation

source:-https://www.iitm.ac.in/donlab/is2017/seg.php.

Group 01 (RTU) BTech Project August 28, 2024 14 / 22


Segmentation
3 :- DNN-HMM with boundary correction (DNN-BC)

The initial monophone alignment (GMM-HMM flat start initialisation) is


modified with GD corrected phone boundaries. This corrected monophone
alignment becomes the initial alignment for DNN training.
after each iteration, the phone alignment is corrected as explained in Block
III of Figure 2.
After the 8th iteration, the boundaries are again corrected using GD
corrected boundaries to get the final phone alignment.

4 :- CNN-HMM with boundary correction (CNN-BC)

Boundary correction is performed similarly to DNN-HMM with boundary


correction as explained in Section 3

Group 01 (RTU) BTech Project August 28, 2024 15 / 22


Experiments and Results

1 Datasets Used

The experiments are conducted on five Indian languages.

A subset of Indic database is used for the experiments.

The details of the data sets used are given in Table 1.

The utterances are recorded by a single native speaker of the corresponding


language in a noise-free studio environment at a sampling rate of 48KHz, 16
bits per sample. For grapheme to phoneme conversion of the native text, a
unified parser for Indian languages is used

Group 01 (RTU) BTech Project August 28, 2024 16 / 22


Experiments and Results

Group 01 (RTU) BTech Project August 28, 2024 17 / 22


Conclusion

Parametric speech synthesis systems also require accurate segmentation of the training data at
phone level for training a good model.

Acoustic modeling using DNNs has shown great promise in the context of ASR for many
languages. Nevertheless, the phone boundaries are still inaccurate for speech synthesis systems

In this paper, an attempt is made to improve the boundaries obtained in a


DNN-HMM/CNN-HMM system using signal processing cues.

TTS systems are built using the obtained phoneme segments.

Sample test utterances used for the evaluation are available at the
https://www.iitm.ac.in/donlab/is2017/seg.php.

Group 01 (RTU) BTech Project August 28, 2024 18 / 22


Future Work

Some of the possible extensions for future work are listed below.

More language-specific rules for languages can be explored.

New languages can be added to the unified parser.

Since the parser uses CLS, TTS systems with code-switching and code-mixing across languages
can be created.

Only around 10-25 per-cent of the boundaries are detected correctly using signal processing
cues. A detailed acoustic analysis may reveal more rules and techniques.

Group 01 (RTU) BTech Project August 28, 2024 19 / 22


References
[1] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech
Communication, vol. 51, no. 11, pp. 1039–1064, 2009.

[2] A. Sethy and S. S. Narayanan, “Refined speech segmentation for concatenative speech
synthesis.” in INTERSPEECH, 2002.

[3] A. W. Black and J. Kominek, “Optimizing segment label boundaries for statistical speech
synthesis.” in ICASSP, 2009, pp. 3785– 3788.

[4] Y. jun Kim and A. Conkie, “Automatic segmentation combining an hmm-based approach
and spectral boundary correction,” in ICSLP, 2002, pp. 145–148.

[5] V. Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit and beyond,”
Speech Communication, vol. 9, no. 4, pp. 351 – 356, 1990. [Online].

[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol.
29, no. 6, pp. 82–97, 2012.

[7] H.-Y. Lo and H.-M. Wang, “Phonetic boundary refinement using support vector machine,”
in ICASSP, vol. 4, 2007, pp. IV–933.

Group 01 (RTU) BTech Project August 28, 2024 20 / 22


References

[8] K.-S. Lee, “Mlp-based phone boundary refining for a tts database,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 981–989, 2006.

[9] A. Stolcke, N. Ryant, V. Mitra, J. Yuan, W. Wang, and M. Liberman, “Highly accurate
phonetic segmentation using boundary correction models and system fusion,” in ICASSP, 2014,
pp. 5552–5556.
-
[10] V. K. Prasad, T. Nagarajan,
and H. A. Murthy, “Automatic segmentation of continuous speech using minimum phase group de

Group 01 (RTU) BTech Project August 28, 2024 21 / 22


The End

Group 01 (RTU) BTech Project August 28, 2024 22 / 22


Questions
Part-A (2 Marks)
Q.1 What are the major language families in India, and why is it challenging to create a single
TTS system for all Indian languages?

Q.2 Define phonetic segmentation ?

Q.3 What are the main limitations of GMM-HMM based segmentation for Indian Languages?

Q.4 How can signal processing cues be leveraged for phonetic segmentation in syllable-timed
Indian Languages?

Q.5 What is sub-band spectral flux (SBSF)?

Q.6 What are the main characteristics of the CNN-HMM segmentation approach?

Q.7 Draw schematic diagram of CNN.

Q.8 What is the RBM in DNN?

Q.9 Why is accurate phonetic segmentation important for TTS systems?

Q.10 How does minimum phase group delay (GD) help in segmentation?

Group 01 (RTU) BTech Project August 28, 2024 1 / 22


Questions
Part-B (5 Marks)
Q.1 What are the rules for boundary correction? Explain with example.

Q.2 What is the role of acoustic cues in the proposed framework?

Q.3 What are the main components involved in the TTS process?

Part-C (10 Marks)

Q.1 Explain Iterative approach of Signal processing cues in tandem with DNN/CNN-HMM? Draw
block diagram.

Q.2 Write DNN-HMM and CNN-HMM algorithms for phonetic segmentation of speech data?

Group 01 (RTU) BTech Project August 28, 2024 2 / 22


Answers

Part-A
Q.1 What are the major language families in India, and why is it challenging to create a
single TTS system for all Indian languages?

Ans. The major language families in India are Indo-Aryan and Dravidian. Since no single language
is spoken across the entire country, creating a single TTS system is challenging. Each language has its
own phonetic and prosodic structure, requiring separate TTS systems for accurate and natural speech
synthesis.

Q.2 Define phonetic segmentation?

Ans. Segmentation is the process of finding the time aligned sequence of phones/syllables in the
speech utterance.

Q.3 What are the main limitations of GMM-HMM based segmentation for Indian Languages?

Ans:- The main limitations of GMM-HMM based segmentation are inaccurate phoneme boundaries,
the need for manual correction, and the lack of hand-labeled data. These limitations hinder the
segmentation process, making it inefficient and error-prone.

Q.4 How can signal processing cues be leveraged for phonetic segmentation in syllable-timed
Indian Languages?

Ans. Signal processing cues, such as minimum phase group delay (GD) and short-term energy (STE),
can be used to estimate syllable boundaries without explicit boundary location.

Q.5 What is sub-band spectral flux (SBSF)?

Ans. spectral flux is used to address the issues of inaccurate syllable boundaries in the context of
fricatives, affricates, nasals and semivowels. A modified version of spectral flux called sub-band
spectral flux (SBSF) is used as a cue for boundary correction

Q.6 What are the main characteristics of the CNN-HMM segmentation approach?

Ans. CNN-HMM segmentation uses 40-dimensional filter bank features with 3 pitch coefficients,
spliced over 11 frames. It has two convolutional layers with 1024 nodes each, using an 8-sized
convolutional window.

Q.7 Draw schematic diagram of CNN.


Q.8 What is the RBM in DNN?

Ans. A Restricted Boltzmann Machine (RBM) is a type of generative stochastic neural network that is
often used as a building block for training Deep Neural Networks (DNNs), especially in unsupervised
or semi-supervised learning tasks. In DNNs, RBMs are commonly used in the pre-training phase to
initialize the weights of the network, which helps improve the convergence and performance of the
model.

Q.9 Why is accurate phonetic segmentation important for TTS systems?

Ans. Phonetic segmentation is crucial because it directly impacts the accuracy of phoneme
boundaries, which in turn affects the naturalness and intelligibility of the synthesized speech.

Q.10 How does minimum phase group delay (GD) help in segmentation?

Ans. GD-based processing of short-term energy (STE) helps in obtaining accurate syllable boundaries
but introduces spurious boundaries for certain sounds.

Part-B

Q.1 What are the rules for boundary correction? Explain with example.

Ans. The correction of the boundary between two syllables, syllable 1 and syllable 2, is hence performed
on the basis of end phone of syllable 1 and start phone of syllable 2. The following correction rules are
applied for obtaining accurate syllable boundaries:
Rule 1: The boundary between syllable 1 and syllable 2 is corrected using STE if syllable 1 does not
end with fricative or nasal and syllable 2 does not begin with a fricative, affricate, nasal or semi-
vowel.

Rule 2: The boundary between syllable 1 and syllable 2 is corrected using SBSF if either the end
phone of syllable 1 or the start phone of syllable 2 is a fricative or an affricate, but not both

Example for Rule 1:

• Consider the syllables “ka” (Syllable 1) and “pa” (Syllable 2) from the word "kapal" (which
means "boat" in Hindi). The boundary between "ka" and "pa" needs to be corrected.

i. Syllable 1: "ka" ends with the sound /k/, which is a stop consonant (not a fricative or nasal).
ii. Syllable 2: "pa" begins with the sound /p/, which is also a stop consonant (not a fricative,
affricate, nasal, or semi-vowel).
• Boundary Correction: Since neither Syllable 1 nor Syllable 2 involves fricatives, nasals, or
affricates, we apply STE (Short-Time Energy) to correct the boundary. This correction is based on
identifying where the energy drops after "ka" and rises again before "pa".

Example for Rule 2:

• Consider the syllables “fas” (Syllable 1) and “ter” (Syllable 2) from the word "faster".

i. Syllable 1: "fas" ends with the sound /s/, which is a fricative.


ii. Syllable 2: "ter" starts with the sound /t/, which is a stop consonant (not a fricative or
affricate).

• Boundary Correction: Since Syllable 1 ends with a fricative (/s/), we apply SBSF (Sub-Band
Signal Features) to correct the boundary. This is because fricatives have distinctive high-frequency
energy, and SBSF helps capture the precise transition from the /s/ sound in "fas" to the stop /t/ sound
in "ter".

Q.2 What is the role of acoustic cues in the proposed framework?

Ans.

• Enhances Phoneme Segmentation: Acoustic cues play a critical role in improving the accuracy of
phoneme segmentation within the framework.

• Uses Group Delay (GD): The framework leverages Group Delay (GD) derived from Short-Term
Energy (STE) and Spectral Boundary Strength Function (SBSF) to improve syllable boundary
detection.

• Identifies Syllable Boundaries: These acoustic cues help in identifying potential syllable
boundaries within the speech signal.

• Refines Phone Alignment: By incorporating GD-corrected syllable boundaries, the framework


refines the initial phone alignments, leading to more precise phoneme segmentation.

• Bridges the Gap: The integration of acoustic cues bridges the gap between the initial phone
alignment obtained from the GMM-HMM method and the final, more accurate phoneme boundaries.

Q.3 What are the main components involved in the TTS process?

Ans. The main components of a text-to-speech (TTS) system include:

1. *Text Analysis:* This stage processes the input text to understand its structure, including
punctuation, sentence boundaries, and homographs.

2. *Linguistic Processing:* This component converts the analyzed text into phonetic representations,
determining pronunciation and stress patterns.

3. *Prosody Generation:* This involves adding intonation, rhythm, and timing to the speech output,
enhancing naturalness.

4. *Speech Synthesis:* The final component generates the actual audio. Techniques include
concatenative synthesis (stringing together pre-recorded speech segments) and parametric synthesis
(using algorithms to create speech).
Part-C

Q.1 Explain Iterative approach of Signal processing cues in tandem with DNN/CNN-HMM?
Draw block diagram.

Ans. Block-I: Syllable boundary correction using signal processing cues (GD of STE, and SBSF)
after GMM- HMM flat start initialisation is shown in Block I of Figure 2.

• GD of STE and SBSF for Syllable


Boundaries:

Group Delay (GD) processing of Short-Term


Energy (STE) and Sub-Band Spectral Flux
(SBSF) is used to accurately detect syllable
boundaries for a subset of syllables.

• GMM-HMM Flat Start Segmentation:

The initial syllable boundaries are obtained


through GMM-HMM flat start segmentation,
followed by correction using GD of STE and
SBSF

• GD-Corrected Phone Boundaries:

The boundaries of the last phone within the


corrected syllable boundaries are marked as
GD-corrected phone boundaries.

• Syllable-to-Phone Dictionary: Figure 2: Block diagram of proposed system

A syllable-to-phone dictionary (syldict) is used to map syllable sequences to corresponding


phoneme sequences.

Block-II: DNN/CNN segmentation without boundary correction

• Acoustic models are built by training the neural networks with the GMM-HMM monophone
alignment (also known as HMM-based phone alignment) as the initial alignment.
• The DNN-HMM/CNN-HMM are then trained iteratively to get accurate final phone boundaries.
This is shown in Block II of Figure 2.
• The number of iterations is set to 8 empirically as the phone boundaries do not change much
afterward.

Block-III: DNN/CNN segmentation with boundary correction

• Similar to segmentation using deep networks, GMM-HMM monophone alignment is used as the
initial phone alignment. These phone alignments are corrected, either forward or backward,
using GD corrected phone boundaries.
• The boundary corrected phone alignments are then used for training neural networks.
• The alignments obtained after deep network training are again corrected using GD corrected
phone boundaries and this process is repeated 8 times iteratively.
• After the 8 th iteration, phone alignment obtained from deep networks are corrected again using
GD corrected phone boundaries as shown in Figure 2.

Q.2 Write DNN-HMM and CNN-HMM algorithms for phonetic segmentation of speech data?
Ans.
Algorithm 1: DNN-HMM segmentation
1. Input Features: 40 dimensional filter bank features are used as input for DNN. The
features are spliced over 11 frames to add context information to DNN.

2. RBM Pretraining (6 layers):


• A layer by layer training of RBM is performed.
• The first layer of RBM is a Gaussian-Bernoulli layer and is trained with an initial learning
rate of 0.01.
• The rest of the layers are Bernoulli-Bernoulli layers and are trained with an initial learning
rate of 0.4.
• The momentum parameter is set to 0.9 and 20 epochs are used for training each layer.

3. DNN Training (6 layers):


• The DNN weights are layer by layer initialized with the pre-trained RBM weights.
• The DNN is trained using stochastic gradient descent using back propagation.
• A mini-batch size of 256 is used for training.
• After each epoch, the network is tested on the error-validation data to determine whether to
accept or reject the model. If the model is rejected the learning rate is halved for the next
epoch.

Algorithm 2: CNN-HMM segmentation

1. Input Features: 40 dimensional filter bank features with 3 pitch coefficients are given as
input to the network. The features are spliced over 11 frames to add context information
into CNN training.

2. Convolutional layer:
• Two convolutional layers are used with 1024 nodes in each layer.
• The convolutional window is of dimension 8.
• A pooling window of size 3 and no overlap of pooling window is used in pooling layer.
• The CNN layer used a feature map number of 256 and 128 for first and second convolutional
layer respectively.

3. Fully connected layer:


• 4 fully-connected layer with 1024 nodes in each hidden layer is used.
• The fully connected layer are trained by first performing a pre-training and followed by
iterative training using the features extracted through CNN layer.

You might also like