Deep Learning Techniques in Tandem With Signal Processing For Phonetic Segmentation
Deep Learning Techniques in Tandem With Signal Processing For Phonetic Segmentation
1 Introduction
3 Proposed system
4 Segmentation
6 Conclusion
7 References
In India, spoken languages belong to several language families, the major ones being
Indo-Aryan and Dravidian.
Nevertheless, the amount of available digital resources in terms of parallel speech and text
corpora is very small [Post et al., 2012,Joy et al., 2014]. There is no single language
spoken in the entire country. Hence, in the context of speech synthesis, separate TTS
systems are needed for these languages
Objective of Phonetic Segmentation:
Accurate phonetic segmentation is essential for building high-quality
text-to-speech (TTS) synthesis systems.
Proper segmentation of speech into phonetic units improves the
naturalness and intelligibility of synthesized speech.
Challenges in Indian Languages:
Indian languages have complex phonetic and prosodic structures,
making accurate segmentation challenging.
Traditional phonetic segmentation methods struggle with the diversity
and richness of Indian languages’ phonetic inventories.
1. Inaccurate Boundaries
The main drawback of the GMM-HMM based forced-alignment is that the phoneme boundaries
are not explicitly modeled, resulting in inaccurate boundary locations.
Accurate phonetic segmentation becomes a problem when only the phoneme sequences
are available and not their boundary locations. For syllable-timed languages, signal
processing cues that are agnostic to the speaker can be used to get syllable boundaries
Boundary Refinement
For syllable-timed languages, minimum phase group delay (GD) based
processing of short-term energy (STE) is used for obtaining syllable
boundaries
Although GD based segmentation gives accurate syllable boundaries, it
introduces a number of spurious boundaries for syllables starting or
ending with a fricative or nasal, and syllables that start with a
semivowel or affricate
Block 1
Syllable boundary correction using signal processing cues (GD of STE, and SBSF)after
GMM-HMM flat start initialisation is shown in this block
Block 2
Acoustic models are built by training the neural networks with the GMM-HMM monophone
alignment (also known as HMM-based phone alignment) as the initial alignment. The
DNN-HMM/CNN-HMM are then trained iteratively to get accurate final phone boundaries.
Block 3
The proposed framework, where the boundaries obtained using DNNs/CNNs are further
corrected using signal processing cues is shown in Block III
Segmentation of speech data is performed at phone level using the following methods:
3. DNN Training :
The DNN weights are layer by layer initialized with the pre-trained RBM weights.
The DNN is trained using stochastic gradient descent using back propagation.
A mini-batch size of 256 is used for training.
After each epoch, the network is tested on the error-validation data to determine whether
to accept or reject the model. If the model is rejected the learning rate is halved for the
next epoch.
source:-https://www.iitm.ac.in/donlab/is2017/seg.php.
2. Convolutional layer:
Two convolutional layers are used with 1024 node in each layer.
The convolutional window is of dimension 8.
A pooling window of size 3 and no overlap of pooling window is used in
pooling layer.
The CNN layer used a feature map number of 256 and 128 for first and
second convolutional layer respectively.
3. Fully connected layer:
source:-https://www.iitm.ac.in/donlab/is2017/seg.php.
1 Datasets Used
Parametric speech synthesis systems also require accurate segmentation of the training data at
phone level for training a good model.
Acoustic modeling using DNNs has shown great promise in the context of ASR for many
languages. Nevertheless, the phone boundaries are still inaccurate for speech synthesis systems
Sample test utterances used for the evaluation are available at the
https://www.iitm.ac.in/donlab/is2017/seg.php.
Some of the possible extensions for future work are listed below.
Since the parser uses CLS, TTS systems with code-switching and code-mixing across languages
can be created.
Only around 10-25 per-cent of the boundaries are detected correctly using signal processing
cues. A detailed acoustic analysis may reveal more rules and techniques.
[2] A. Sethy and S. S. Narayanan, “Refined speech segmentation for concatenative speech
synthesis.” in INTERSPEECH, 2002.
[3] A. W. Black and J. Kominek, “Optimizing segment label boundaries for statistical speech
synthesis.” in ICASSP, 2009, pp. 3785– 3788.
[4] Y. jun Kim and A. Conkie, “Automatic segmentation combining an hmm-based approach
and spectral boundary correction,” in ICSLP, 2002, pp. 145–148.
[5] V. Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit and beyond,”
Speech Communication, vol. 9, no. 4, pp. 351 – 356, 1990. [Online].
[6] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol.
29, no. 6, pp. 82–97, 2012.
[7] H.-Y. Lo and H.-M. Wang, “Phonetic boundary refinement using support vector machine,”
in ICASSP, vol. 4, 2007, pp. IV–933.
[8] K.-S. Lee, “Mlp-based phone boundary refining for a tts database,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 981–989, 2006.
[9] A. Stolcke, N. Ryant, V. Mitra, J. Yuan, W. Wang, and M. Liberman, “Highly accurate
phonetic segmentation using boundary correction models and system fusion,” in ICASSP, 2014,
pp. 5552–5556.
-
[10] V. K. Prasad, T. Nagarajan,
and H. A. Murthy, “Automatic segmentation of continuous speech using minimum phase group de
Q.3 What are the main limitations of GMM-HMM based segmentation for Indian Languages?
Q.4 How can signal processing cues be leveraged for phonetic segmentation in syllable-timed
Indian Languages?
Q.6 What are the main characteristics of the CNN-HMM segmentation approach?
Q.10 How does minimum phase group delay (GD) help in segmentation?
Q.3 What are the main components involved in the TTS process?
Q.1 Explain Iterative approach of Signal processing cues in tandem with DNN/CNN-HMM? Draw
block diagram.
Q.2 Write DNN-HMM and CNN-HMM algorithms for phonetic segmentation of speech data?
Part-A
Q.1 What are the major language families in India, and why is it challenging to create a
single TTS system for all Indian languages?
Ans. The major language families in India are Indo-Aryan and Dravidian. Since no single language
is spoken across the entire country, creating a single TTS system is challenging. Each language has its
own phonetic and prosodic structure, requiring separate TTS systems for accurate and natural speech
synthesis.
Ans. Segmentation is the process of finding the time aligned sequence of phones/syllables in the
speech utterance.
Q.3 What are the main limitations of GMM-HMM based segmentation for Indian Languages?
Ans:- The main limitations of GMM-HMM based segmentation are inaccurate phoneme boundaries,
the need for manual correction, and the lack of hand-labeled data. These limitations hinder the
segmentation process, making it inefficient and error-prone.
Q.4 How can signal processing cues be leveraged for phonetic segmentation in syllable-timed
Indian Languages?
Ans. Signal processing cues, such as minimum phase group delay (GD) and short-term energy (STE),
can be used to estimate syllable boundaries without explicit boundary location.
Ans. spectral flux is used to address the issues of inaccurate syllable boundaries in the context of
fricatives, affricates, nasals and semivowels. A modified version of spectral flux called sub-band
spectral flux (SBSF) is used as a cue for boundary correction
Q.6 What are the main characteristics of the CNN-HMM segmentation approach?
Ans. CNN-HMM segmentation uses 40-dimensional filter bank features with 3 pitch coefficients,
spliced over 11 frames. It has two convolutional layers with 1024 nodes each, using an 8-sized
convolutional window.
Ans. A Restricted Boltzmann Machine (RBM) is a type of generative stochastic neural network that is
often used as a building block for training Deep Neural Networks (DNNs), especially in unsupervised
or semi-supervised learning tasks. In DNNs, RBMs are commonly used in the pre-training phase to
initialize the weights of the network, which helps improve the convergence and performance of the
model.
Ans. Phonetic segmentation is crucial because it directly impacts the accuracy of phoneme
boundaries, which in turn affects the naturalness and intelligibility of the synthesized speech.
Q.10 How does minimum phase group delay (GD) help in segmentation?
Ans. GD-based processing of short-term energy (STE) helps in obtaining accurate syllable boundaries
but introduces spurious boundaries for certain sounds.
Part-B
Q.1 What are the rules for boundary correction? Explain with example.
Ans. The correction of the boundary between two syllables, syllable 1 and syllable 2, is hence performed
on the basis of end phone of syllable 1 and start phone of syllable 2. The following correction rules are
applied for obtaining accurate syllable boundaries:
Rule 1: The boundary between syllable 1 and syllable 2 is corrected using STE if syllable 1 does not
end with fricative or nasal and syllable 2 does not begin with a fricative, affricate, nasal or semi-
vowel.
Rule 2: The boundary between syllable 1 and syllable 2 is corrected using SBSF if either the end
phone of syllable 1 or the start phone of syllable 2 is a fricative or an affricate, but not both
• Consider the syllables “ka” (Syllable 1) and “pa” (Syllable 2) from the word "kapal" (which
means "boat" in Hindi). The boundary between "ka" and "pa" needs to be corrected.
i. Syllable 1: "ka" ends with the sound /k/, which is a stop consonant (not a fricative or nasal).
ii. Syllable 2: "pa" begins with the sound /p/, which is also a stop consonant (not a fricative,
affricate, nasal, or semi-vowel).
• Boundary Correction: Since neither Syllable 1 nor Syllable 2 involves fricatives, nasals, or
affricates, we apply STE (Short-Time Energy) to correct the boundary. This correction is based on
identifying where the energy drops after "ka" and rises again before "pa".
• Consider the syllables “fas” (Syllable 1) and “ter” (Syllable 2) from the word "faster".
• Boundary Correction: Since Syllable 1 ends with a fricative (/s/), we apply SBSF (Sub-Band
Signal Features) to correct the boundary. This is because fricatives have distinctive high-frequency
energy, and SBSF helps capture the precise transition from the /s/ sound in "fas" to the stop /t/ sound
in "ter".
Ans.
• Enhances Phoneme Segmentation: Acoustic cues play a critical role in improving the accuracy of
phoneme segmentation within the framework.
• Uses Group Delay (GD): The framework leverages Group Delay (GD) derived from Short-Term
Energy (STE) and Spectral Boundary Strength Function (SBSF) to improve syllable boundary
detection.
• Identifies Syllable Boundaries: These acoustic cues help in identifying potential syllable
boundaries within the speech signal.
• Bridges the Gap: The integration of acoustic cues bridges the gap between the initial phone
alignment obtained from the GMM-HMM method and the final, more accurate phoneme boundaries.
Q.3 What are the main components involved in the TTS process?
1. *Text Analysis:* This stage processes the input text to understand its structure, including
punctuation, sentence boundaries, and homographs.
2. *Linguistic Processing:* This component converts the analyzed text into phonetic representations,
determining pronunciation and stress patterns.
3. *Prosody Generation:* This involves adding intonation, rhythm, and timing to the speech output,
enhancing naturalness.
4. *Speech Synthesis:* The final component generates the actual audio. Techniques include
concatenative synthesis (stringing together pre-recorded speech segments) and parametric synthesis
(using algorithms to create speech).
Part-C
Q.1 Explain Iterative approach of Signal processing cues in tandem with DNN/CNN-HMM?
Draw block diagram.
Ans. Block-I: Syllable boundary correction using signal processing cues (GD of STE, and SBSF)
after GMM- HMM flat start initialisation is shown in Block I of Figure 2.
• Acoustic models are built by training the neural networks with the GMM-HMM monophone
alignment (also known as HMM-based phone alignment) as the initial alignment.
• The DNN-HMM/CNN-HMM are then trained iteratively to get accurate final phone boundaries.
This is shown in Block II of Figure 2.
• The number of iterations is set to 8 empirically as the phone boundaries do not change much
afterward.
• Similar to segmentation using deep networks, GMM-HMM monophone alignment is used as the
initial phone alignment. These phone alignments are corrected, either forward or backward,
using GD corrected phone boundaries.
• The boundary corrected phone alignments are then used for training neural networks.
• The alignments obtained after deep network training are again corrected using GD corrected
phone boundaries and this process is repeated 8 times iteratively.
• After the 8 th iteration, phone alignment obtained from deep networks are corrected again using
GD corrected phone boundaries as shown in Figure 2.
Q.2 Write DNN-HMM and CNN-HMM algorithms for phonetic segmentation of speech data?
Ans.
Algorithm 1: DNN-HMM segmentation
1. Input Features: 40 dimensional filter bank features are used as input for DNN. The
features are spliced over 11 frames to add context information to DNN.
1. Input Features: 40 dimensional filter bank features with 3 pitch coefficients are given as
input to the network. The features are spliced over 11 frames to add context information
into CNN training.
2. Convolutional layer:
• Two convolutional layers are used with 1024 nodes in each layer.
• The convolutional window is of dimension 8.
• A pooling window of size 3 and no overlap of pooling window is used in pooling layer.
• The CNN layer used a feature map number of 256 and 128 for first and second convolutional
layer respectively.