2015 CS 069
2015 CS 069
N.A.K.H.S.Karunathilaka
2015/CS/069
.......................................... ................................................
Date Signature of the Student
.......................................... ................................................
Date Signature of the Supervisor
i
Abstract
ii
Preface
This document has been written for the partial fulfillment of the requirements
of the B.Sc. in Computer Science (Hons) Final Year Project in Computer Sci-
ence(SCS4124). I was engaged in researching and writing this dissertation from
January 2019 to February 2020.
This basis for the research originally stemmed from my passion for contributing
to the researches relevant to my native language - Sinhala. In truth, it helps to con-
tribute to the preservation of the language. After discussing with my supervisor,
Mr.V.Welgama, we identified the problems that have emerged due to the unavail-
ability of automatic speech recognition for the Sinhala language. As the world
moves further into the digital age, the importance of automatic speech recognition
rises, since it is able to develop natural interfaces for both literate and illiterate
users by enabling hands-free technology and even aiding hearing-impaired people.
To the best of my knowledge, research work on ASR for the Sinhala language
using deep learning approaches has not been carried out so far. First, we identified
the main factors that make the ASR task challenging. They were the lack of
resources and the morphological richness in the Sinhala language. As described
in Chapter 3, we selected the DNN-HMM approach and designed the suitable
model architecture in such a way that it gets the best out of resources. We first
pre-processed the data set and trained the statistical baseline model. Thereafter
I experimented on different deep neural networks one by one and based on the
observations, I modified and tuned the architectures to obtain optimal performance.
When analyzing the results it was observed that a rich text corpus along with more
training data is necessary requirements for a robust ASR system.
With constant guidance and supervision of my supervisor and co-supervisor,
more conclusions were drawn on evaluating and training the models. This piece of
research would be a great source of knowledge for future research on Sinhala ASR
systems.
iii
Acknowledgement
iv
Contents
Declaration i
Abstract ii
Preface iii
Acknowledgement iv
Contents vii
List of Figures ix
List of Tables x
Acronyms xi
1 Introduction 1
1.1 Background to the Research . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Problem and Research Questions . . . . . . . . . . . . . . 2
1.3 Justification for the research . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.1 Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Pre-processing Raw Data . . . . . . . . . . . . . . . . . . . 5
1.4.3 Deep Neural Networks(DNNs) . . . . . . . . . . . . . . . . . 5
1.4.4 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Delimitations of Scope . . . . . . . . . . . . . . . . . . . . . . . . . 7
v
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 8
2.1 Review on Sound characteristics and feature extraction . . . . . . 8
2.2 Review on different approaches to ASR . . . . . . . . . . . . . . . . 8
2.3 Review on Deep Neural Network algorithms for acoustic modeling 9
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Design 17
3.1 Acoustic Model (AM) . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Getting alignments from GMM-HMM model . . . . . . . . . 19
Training algorithms for GMM-HMM model . . . . . . . . . 21
Determining hyper-parameters for triphone models . . . . . 21
3.1.3 DNN training . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Mini-batch Stochastic Gradient Descent . . . . . . . . . . . 22
Dropouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
RBM Pre-training . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Implementation 26
4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Implementing the baseline model - (GMM-HMM) . . . . . . . . . . 28
4.3 Implementing DNN models . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Pre-trained DNN model . . . . . . . . . . . . . . . . . . . . 29
4.3.2 Non pre-trained DNN model . . . . . . . . . . . . . . . . . . 29
4.3.3 TDNN models . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Research Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vi
5.1.1 Word Error Rate (WER) . . . . . . . . . . . . . . . . . . . 33
5.2 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2 Results of GMM-HMM model . . . . . . . . . . . . . . . . . 35
5.2.3 Results of DNN models . . . . . . . . . . . . . . . . . . . . 35
5.2.4 Evaluation of results . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusions 43
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Conclusions about research questions and objectives . . . . . . . . . 43
6.3 Conclusions about research problem . . . . . . . . . . . . . . . . . . 45
6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.5 Implications for further research . . . . . . . . . . . . . . . . . . . . 46
References 47
Appendices 51
A Model Specifications 52
B Decoded text 55
vii
List of Figures
2.1 WER(%) for the 11 shows from GMM-HMM and DNN-HMM KATS
systems (Fohr et al., 2017) . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Speech recognition results of different strategies of constructing deep
LSTM networks. (Li and Wu, 2014) . . . . . . . . . . . . . . . . . . 12
2.3 Results of deep grid LSTMs on four different speech corpora(AMI,
HKUST, GALE Mandarin, Arabic MGB).(Hsu et al., 2016) . . . . 12
2.4 The results on the Kaldi baseline model using a multi-layer percep-
tron (MLP).(Markovnikov et al., 2018) . . . . . . . . . . . . . . . . 13
2.5 The performance of BLSTM over LSTM and other models.(Markovnikov
et al., 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Baseline vs TDNN on various LVCSR tasks with different amount
of training data. (Peddinti et al., 2015) . . . . . . . . . . . . . . . . 15
viii
5.1 Four translated example sentences based on baseline GMM-HMM,
pre-trained DNN, non-pre-trained DNN, TDNN, and TDNN+LSTM.
Phrases in the bold green text show the exact matching compared
to the correct test sentence. Phrases in the bold red text show
the words that are incorrectly translated by the models while the
phrases highlighted in yellow shows the word segmentation issues
and slight deviations . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ix
List of Tables
x
Acronyms
xi
Chapter 1
Introduction
1
Statistical approaches like GMM-HMM has been the state-of-art of speech
recognition in the early days. Recently, Deep Neural Networks (DNN) has become
a flagship and has proven it by improving the achieved results significantly (Hinton
et al., 2012). Moreover, DNN has the capability of generalization and the ability
to discover and learn complex structures (Deng et al., 2013),(Hinton et al., 2012).
The use of deep learning (Du et al., 2016) neural architectures such as Deep Neu-
ral Networks(DNN), Convolutional Neural Networks(CNN), and Recurrent Neural
Networks(RNN) for the speech recognition of English and European languages
have shown significant improvement with compared to conventional GMM-HMM
approach (Fohr et al., 2017),(Markovnikov et al., 2018).
2
a morphologically rich language, which makes it difficult to further as they may
produce a vast number of word forms for a given root form.
The main research question that is addressed in the research is as follows.
• What deep neural architectures will perform well for Sinhala ASR with lim-
ited resources?
There is a vast range of deep neural architectures extended for tasks related to
speech recognition, voice detection, etc. But the question arises which struc-
tures would provide the most accurate transcriptions to achieve our goal?.
When addressing this question, two more essential facts need to be consid-
ered; the limited data set and the complexity of the Sinhala language. With
time constraints, it is often hard to collect a sufficient amount of resources.
Hence, by addressing this research question, this work would be able to iden-
tify the specifications that a deep neural network model should possess to
achieve the best possible accurate results with limited resources.
3
a minimal data set for training and testing. Thus the validity of them is again
questionable.
There exist numerous deep architectures that could be applied for the task of
speech recognition as the possibilities are almost endless. This project intends to
implement a sufficient number of deep architectures for acoustic modeling in the
intended ASR system. Thus, for those who are interested in this field can get an
overview of the performances of different deep architectures and which architec-
ture or the approach is the best suited for an under-resourced, morphological rich
language like Sinhala. The project also intends to compare the performances of
deep neural architectures with the existing standard statistical methods. There-
fore, other researchers can come up with a more advanced ASR system for the
Sinhala language.
1.4 Methodology
There are several approaches for developing an ASR system such as GMM-HMM,
DNN-GMM-HMM, DNN-HMM (Pallavi Saikia and Open Learning, 2017), End-
to-end DNN (Zhang et al., 2017). Among them two significant approaches are
DNN-HMM: a hybrid architecture which uses deep neural networks for acoustic
modeling along with a Hidden Markov Model(HMM); End-to-end DNN : which
takes acoustic features as input and outputs its transcriptions directly, thus relying
solely on a deep learning architecture.
The End-to-End (E2E ASR) is a single integrated approach with a much simpler
training pipeline, and it reduces the training time and decoding time. However,
current E2E ASR systems also suffer from limitations such as these systems need
orders of magnitude more training data than hybrid ASR systems to achieve similar
word error rate (WER). These limitations arise when the training data is limited;
there is a propensity to overfit the training data. Thus, it becomes quite expensive
as it to get a higher performance, a large number of speech data should be fed
and also needs higher computing power. The DNN-HMM hybrid systems comprise
an acoustic model, a language model, and a pronunciation model. For limited
computation power and training data, this hybrid architecture can achieve better
4
results (Pallavi Saikia and Open Learning, 2017) (Fohr et al., 2017). Therefore,
this project intends to follow the DNN-HMM hybrid approach (Pallavi Saikia and
Open Learning, 2017), as it enriches with the strong learning power of DNNs and
the sequential modeling of HMMs.
The raw speech signals are first pre-processed into a vector of numeric values, which
is also known as feature extraction. For this, the raw speech signal is divided into
small portions of typically 25 ms frames shifted by 10ms each time. A transforma-
tion is then applied as the human hearing perceptron is not linear with frequency
scales. This process can be performed with Mel-Frequency Cepstral Coefficients
(MFCCs) or filter banks.
DNNs work similar to the neurons in our brain. They consist of highly intercon-
nected units known as neurons and forms a data processing element. Mostly, this
project intends to focus on deep neural networks that support supervised learning.
Experiments with Deep Neural Networks(DNNs), Time-Delay Neural Networks
(TDNN/CNN1-d), Long Short Term Memory(LSTM) Networks have been carried
out for acoustic modeling. These models take a window of frames that includes
real-valued acoustic features as inputs and estimate the likelihoods of phones. The
output layer from these deep neural networks is then integrated with the Hidden
Markov Model(HMM). With the observation probabilistic scores gained by the
neural network, HMM maps them to a sequence of phones. The DNN models are
trained and fine-tuned empirically by analyzing the speech accuracy of the valida-
5
tion data set. The analysis is performed quantitatively by calculating the Word
Error Rate(WER) of the models.
A language model consists of a large amount of text data, and it aims to compute
the probability of the sequence of words to find the best word sequence of the
acoustic model. In this research, the most well-known n-gram model technique
is used to model the language. A corpus containing more than 20K sentences
collected from phonetically balanced corpora were used to generate the 220K long
grapheme lexicon.
1.6 Definitions
• GMM-HMM model is defined with several names such as ”baseline,” ”statis-
tical,” and ”traditional” in this document.
6
1.7 Delimitations of Scope
The below listed are boundaries that are faced when addressing the research ques-
tion,
• Achieving 100% performance with the limited data set will not be guaranteed
since deep neural networks inherently perform better with large data sets.
1.8 Conclusion
This chapter laid the foundations for the dissertation. It introduced the general
focus area and the more specific research problem and research question that are
addressed in this research. Then the research was justified analyzing the signif-
icance of the study, the important factors in methodology was briefly described
and justified, the dissertation was outlined, and the limitations were given. On
these foundations, the dissertation can proceed with a detailed description of the
research.
7
Chapter 2
Literature Review
8
a set of pre-recorded words or templates in order to find the best match (Sak-
samudre et al., 2015). Inefficiency in terms of both storage and computation
power and tediously speaker dependence are the drawbacks of this method. The
knowledge-based approach uses the information regarding phonetic, linguistic, and
spectrogram (Saksamudre et al., 2015), but it suffers from the limitation that it
requires expert knowledge on the language. The dynamic warping-based (DTW)
approach measures the similarity between two sequences which may vary in time
or speed, and this algorithm effectively works to cope with different vocalization
speeds(Saksamudre et al., 2015). Generally, DTW works well for only isolated word
recognition, which is a limitation. The most known approach, which has been state-
of-the-art for speech recognition for several years, is the statistical-based approach.
There, the most extended way is the traditional GMM-HMM hybrid system.
9
that the number of layers has only a slight influence on the speech recognition re-
sults. In interest, the results show that increasing the number of hidden units leads
to increasing WER, which may be caused due to the limited amount of training
data. The results showed that DNN based acoustic models have well performed
compared with the baseline GMM-HMM models with a reduction of 5% of WER.
An overview of different architectures and training procedures for DNN-based
acoustic models for the task of recognizing French speech is presented in a research
paper done by Dominique Fohr et al. (2017)(Fohr et al., 2017). The network is
implemented as an MLP with six hidden layers of 2048 neurons per layer. The
experimented dataset consists of 300 hours of manually transcribed shows from
French-speaking radio stations from which 250h recorded in a studio and 50h on
the telephone. As in figure 2.1, the results showed a significant difference in
performance between the baseline GMM-HMM model and the DNN hybrid model,
which suggests that DNN-based acoustic models achieve better classification and
generalization ability.
Figure 2.1: WER(%) for the 11 shows from GMM-HMM and DNN-HMM KATS
systems (Fohr et al., 2017)
10
ican English speakers. The results showed that pre-training is much more helpful
in deep neural nets than in shallow ones, mainly when limited amounts of labeled
training data are available. Also, It has resulted in reducing over-fitting, and the
time required for discriminative fine-tuning with back-propagation.
They have experimented with the AMUAV database(Samudravijaya et al.,
2000), and the results demonstrate that CD-DNN-HMMs outperform the conven-
tional CD-GMM-HMMs model and provide the improvement in a word error rate of
3.1% over the traditional model of triphone. Some recent studies(2018)(Kimanuka
and BUYUK, 2018),(Deka et al., 2018),(Saurav et al., 2018) investigated on low
resourced languages such as Turkish, Assamese, Bengali speech recognition, shows
that the use of deep neural networks for acoustic modeling has resulted in lower
WER compared with baseline GMM-HMM models.
The following literature reviews attempt to demonstrate the usage of different
advance neural network architectures for speech recognition and clarify the best
results obtained from them.
Recently, many pieces of research have been carried out investigating Recurrent
Neural Networks(RNNs) for the task of speech recognition because of its ability
to utilize dynamically changing temporal information. Although deep RNNs have
been argued to be able to model temporal relationships at different time granular-
ities, it suffers from vanishing gradient problems.
An exploration of novel approaches to constructing deep long short-term mem-
ory (LSTM) based deep recurrent neural networks are presented in the research
work by Xiangang Li (2015)(Li and Wu, 2014). The evaluations of different LSTM
networks have been done on a large vocabulary Mandarin Chinese conversational
telephone speech recognition task. The results, as shown in figure 2.2, reveals that
constructing deep LSTM architecture outperforms the standard shallow LSTM
networks and DNNs.
11
Figure 2.2: Speech recognition results of different strategies of constructing deep
LSTM networks. (Li and Wu, 2014)
In the research study by Wei-Ning Hsu et al. (2016)(Hsu et al., 2016), they
have experimented deep grid LSTMs (Kalchbrenner et al., 2015) on four different
speech corpora(AMI, HKUST, GALE Mandarin, Arabic MGB) using the Kaldi
toolkit to generate Mel-scale log filter bank coefficients along with first and second
derivatives and tri-gram language model. The Computational Network Toolkit
(CNTK) has been used for the rest of neural network training. They have used the
prioritized Grid LSTM (pGLSTM) model to prioritize the depth dimension over
the temporal one to provide more updated information for the depth dimension.
The performances of baseline models and proposed models are summarized in figure
2.3. The results obtained show that the different grid LSTM architectures have
outperformed the vanilla LSTM model and suggest that prioritizing the depth
dimension is essential for achieving better performance.
Figure 2.3: Results of deep grid LSTMs on four different speech corpora(AMI,
HKUST, GALE Mandarin, Arabic MGB).(Hsu et al., 2016)
12
for recognizing Russian speech. The acoustic models of their proposed work are
implemented as CNNs, modifications of LSTM, Residual Networks, and Region-
based Convolutional Neural Networks (RCNNs). They have experimented with the
models on more than 30h of Russian speech. The authors have done a significant
job, analyzing each of the models and their results, to come up with a best-suited
model for Russian speech recognition. The results on the Kaldi baseline model using
a multi-layer perceptron (MLP), as in figure 2.4, reveals that the best outcome is
achieved when the activation function is set to p-norm.
Figure 2.4: The results on the Kaldi baseline model using a multi-layer perceptron
(MLP).(Markovnikov et al., 2018)
They have compared the results obtained from their LSTM model with a Bidi-
rectional Long Term Short Memory (BLSTM) model, which has used nnet3 Kaldi’s
configurations. The performance of BLSTM over LSTM can be viewed in fig-
ure 2.5. According to the results, except CNN model (24.96%), other models
which are LSTM (23.32%), BLSTM (23.08%),PLSTM(24.12%),ResNet(22.17%)
and RCNN(22.56%) have surpassed the baseline model (24.26%). A new model
is implemented after analyzing the above best models, including RCNN + residual
unit + max-pooling + BLSTM, which has obtained the lowest WER of 22.07%
and with a reduction of 7.5% WER compared to Kaldi baseline.
13
Figure 2.5: The performance of BLSTM over LSTM and other mod-
els.(Markovnikov et al., 2018)
The experimental results have revealed that, although ResNet shows the best
results, it has been the slowest out of all.
A combination of CNN-BLSTM architectures for acoustic modeling is presented
in (2018)(Markovnikov et al., 2018) for the task of Microsoft’s conversational speech
recognition system for the Switchboard and CallHome domains. They have applied
3 CNN on the acoustic features at a time t and then applied 6 BLSTM layers to the
resulting time sequence. Unlike in an original BLSTM model, they have included
the context of each time point as an input feature in the model.
The introduction of Time Delay Neural Networks (TDNNs) was presented in
the research work by Daniel Povey et al. (2015)(Peddinti et al., 2015), where the
researchers achieved significant improvements in the field of speech recognition.
Their proposed TDNN architecture, models long term temporal dependencies with
training times comparable to standard feed-forward DNNs. During the training
phase, researchers have used the sub-sampling technique to reduce computation
power. The work presents results on several LVCSR tasks to show the effectiveness
of the TDNN architecture in learning broader temporal dependencies in both small
and large data scenarios. The results obtained on the Switchboard task show a rel-
ative improvement of 6% WER over the baseline DNN model, which is significant.
The performance of TDNNs over different LVCSR tasks is indicated in figure 2.6.
14
Figure 2.6: Baseline vs TDNN on various LVCSR tasks with different amount of
training data. (Peddinti et al., 2015)
15
tional resources meet.
In the research paper by Irina Kipyatkova in 2017(Subasa, n.d.), they have
studied an application of time-delay neural networks (TDNNs) in acoustic modeling
for large vocabulary continuous Russian speech recognition and compared it with
baseline DNN model with p-norm activation functions implemented according to
Dan’s implementation in Kaldi. Training of acoustic models has been carried out on
a Russian speech corpus containing phonetically balanced phrases with a duration
of 30h. They have created several TDNNs with a diverse number of hidden layers,
different temporal contexts, and splice indexes. The TDNN has achieved the lowest
WER of 19.04% with five hidden layers and a time context of [−8, 8]. The usage of
the models with larger temporal times has led to increasing in WER that also can
be caused by over-training. The results show that the TDNN model has surpassed
the results obtained by the baseline DNN with a relative WER reduction of 9%.
2.4 Summary
As a summary of the literature review presented in Chapter 2, the following points
can be notified.
16
Chapter 3
Design
As stated before, speech recognition systems tend to characterize the acoustic infor-
mation of a given audio signal and recognize its text version. Thus, the recognizer
needs to segment the audio signal into successive frames where each frame outputs
corresponding phone and then transcribe the recognized phones into a text. This
research design involves developing a method that takes the utterances of speakers
as audio signals and produces the texts corresponding to those.
Figure 3.1 demonstrates the high-level architecture of the proposed solution for
the speech recognition of the Sinhala language. The design process of the research
is described in detail below.
17
Figure 3.1: High-level architecture of the research design
1. acoustic model
2. lexicon
3. language model
18
P (O|W )P (W )
W ∗ = maxw (3.1)
P (O)
First, the raw audio files are pre-processed to obtain a vector of numeric values,
which are often referred to as ‘Mel frequency spectrograms,’ which contains the
acoustic information. The speech features are extracted by Mel Frequency Cep-
stral Coefficients(MFCC) features after normalizing by a standard 13- dimensional
cepstral mean-variance. For that, the raw audio data are segmented into 25 ms of
frames shifted by 10ms each time. Then for each frame, a windowing function is
applied to extract the data. As the next step, a Fourier Transform is applied to
convert the samples from the time domain to the frequency domain, which helps
to compute the power spectrum.
To tackle the problem of limited data available for the Sinhala language, training
of DNNs immediately from utterance level transcriptions is not performing since
DNN requires a good initial approximation. Even with useful data, DNN training
is tricky because it’s not guaranteed to converge to an optimal point. Therefore,
19
as the first phase of the acoustic modeling, a GMM-HMM model is trained on the
same data set to generate the alignments for the audio signals, as in the figure
3.2. Then, the DNN training is bootstrapped using the labeled frames (phoneme-
to-audio alignments), which were generated by the GMM-HMM system. Thus,
when the audio frames are fed into the input layer of the DNN model, the net will
assign a phoneme label to a frame, and it will be compared with the phoneme label
obtained from the GMM-HMM alignments.
The final alignments of the GMM-HMM model are taken after generating mono-
phone HMMs and thereafter triphone HMMs.
A pass of the alignment process is repeated after each training process to op-
timize the correct predictions between the text transcriptions and audio and also
to make sure to have the proper latest alignments for the latest model in each
20
stage. Standard delta+delta-delta and LDA+MLLT training algorithms were used
to obtain better alignments.
• Delta+delta-delta
This algorithm computes the delta and double delta features that represent
the first and second derivatives of the features, respectively. These delta
features are computed on the window of the original MFCC features while the
double delta features are computed on that of the delta features computed.
• LDA+MLLT
LDA+MLLT term stands for Linear discriminant analysis – Maximum likeli-
hood Linear Transform. LDA builds HMM states for the feature vectors, but
with reduced feature space for all data. MLLT takes that reduced feature
space output by LDA and derives a unique transformation for each speaker.
This process is considered as a step for speaker normalization to minimize
the differences among speakers.
Training of triphone models takes into consideration two parameters; the number of
leaves in the decision tree(HMM states) and the total number of Gaussians across
all states in the model for fine-tuning the model for the best alignments. The
number of leaves parameter sets the maximum number of leaves in the decision
tree while the number of Gaussians sets the maximum number of Gaussians dis-
tributed across the leaves. The number of Gaussians per leaf can be calculated by
dividing the number of total Gaussians by the number of leaves. Since, 47 Sinhala
phonemes are used in this project, it will require 47*47*47; nearly 100,000 models if
a separate model is used for each triphone which will be computationally infeasible.
These parameters were set using the information retrieved after literature re-
viewing and then fine-tuning for our model, changing one parameter at once. In
triphone training, the current phone, the preceding, and the following phone are
considered as well. Thus, it requires at least 47*3 HMM states to model the con-
21
textual variation in the triphone model. Therefore,training of models were started
with HMM states beginning from 250.
As the final stage of the acoustic model, the resulting alignments and features
are passed through different deep learning architectures such as a feed-forward
network(DNN), Time Delay Neural Network(TDNN), and hybrid architectures
of TDNN and LSTM(TDNN-LSTM). Since DNNs inherently perform well with
large datasets, for our dataset, different techniques such as mini-batch Stochastic
Gradient Descent, RBM pre-training,early-stopping, and dropouts were applied to
enhance the performances of the models without getting over-fitted.
The data set is divided into several n small batches, and the model error is cal-
culated to update the model coefficients. Using this gradient descent method, the
models enrich with the robustness of stochastic gradient descent and also the effi-
ciency from training in small batches.
Dropouts
RBM Pre-training
According to Kaldi nnet1 setup(Hinton et al., 2012), before training the feed-
forward network, an unsupervised pre-training process is applied to the training
data. The resulting pre-trained Deep Belief Network is then passed into DNN
training. With the use of pre-training, it adds a robustness to the system by giving
a better generalization consistently.
22
3.2 Lexicon
The primary forms of lexicons found in natural language processing are phoneme
and grapheme lexicons. The phoneme based lexicon takes into consideration the
different pronunciation for each grapheme word; thus it has different sequences for
each word, while the grapheme lexicon doesn’t take the pronunciation into account.
The proposed lexicon for the research is of type - grapheme. The grapheme lexicon
that is used in this work contains over 220K entries with one unique grapheme
sequence per word. This lexicon was created using the transcripts of the UCSC
LTRL phonetically balanced corpus. Extracted Sinhala words were encoded to
English letters and modified according to the rules in Sinhala transliteration. For
this, the “Subasa” Sinhala transliteration software that has been developed by
UCSC LTRL was used by modifying its Java scripts accordingly to prepare a lexicon
in the Kaldi standard format. The figure 3.3 depicts the Sinhala transliteration
scheme, which includes 28 distinct consonants, 19 distinct vowels where one of
them is represented by a consonant character,17 distinct modifiers where vowel
characters represent 15 of them. Therefore, altogether the length of the distinct
phoneme set is 47 characters.
23
Figure 3.3: Sinhala transliteration scheme
3.4 Summary
In this chapter, the high level architecture and the overall design for addressing the
research question were discussed in detail. The main components of the design and
24
each of their contributions to the ASR system were stated in this chapter. Next
chapter, Chapter 4 will discuss the implementation of these components in detail.
25
Chapter 4
Implementation
The following sections will present the implementations performed in each step.
Mainly, the experiments were carried out in the Kaldi speech recognition toolkit
(Povey et al., 2011), which is freely available under the Apache License.
• Data Preparation
26
Figure 4.1: Part of the generated text file
The figure 4.3 shows a part of the lexicon which was generated according to the
Kaldi format by modifying the ‘Subasa Transliteration Software’ as stated under
the design chapter. SIL and UNK refer to the silence and spoken noise, respectively.
27
4.2 Implementing the baseline model - (GMM-
HMM)
A detailed theoretical description of the steps involved in the implementation of
the GMM-HMM model was presented in Chapter 3. Figure 4.4 presents the first
part of the implementation steps of the monophone and triphone passes in the
GMM-HMM model in an abstract way.
28
4.3.1 Pre-trained DNN model
Layer-wise, pre-training is a still used technique that helps neural nets to converge
faster and better. If the pre-training process is done rightly, it can put the model
into better spots in function space that allow for better generalization, regularizing
the architecture in a local, dataset dependent way. According to (Hinton et al.,
2012), the supervised optimization from pre-trained weights consistently yields bet-
ter performances rather than from randomly initialized weights in a neural network.
The only difference from the standard neural network training is that its starting
point in parameter space will be obtained after unsupervised pre-training.
Karel’s nnet1(Hinton et al., 2012) sample setup present in the Kaldi toolkit
which has been implemented according to (Hinton, 2010) was used for the pre-
training process. The pre-training was done unsupervised manner on the training
data set using a stack of Restricted Boltzmann machines, which is also known as
a “Deep Belief Network.”
After training with DNN models, the next model selection was a Time Delay Neural
Network as they represent a mapping between past and present values. Although
the same memory capture can be achieved through RNNs, when learning long
term dependencies with RNNs, the “vanishing/exploding” gradient problem oc-
curs, which means that as the error signals are propagated backward through the
network’s structure they tend to vanish or explode.
In recent researches such as (Liu et al., 2019),(Peddinti et al., 2015) and (Huang
29
et al., 2019), time-delay neural networks with sub-sampling have been proposed
for effective modeling of long temporal contexts of speech. In a TDNN, the upper
layers deal with information from a wider temporal context and thus can learn
wider temporal relationships.
Two TDNN network types were experimented on the Kaldi toolkit. The first
network has an asymmetric left and right context spliced frames while the second
one is a factorized form of the TDNN with symmetric left and right contexts(time-
strides), which has been introduced in the paper (Povey et al., 2018). The signifi-
cant difference in the factorized TDNN is that it uses the resnet-type skips rather
than skip-splicing. The experiments related to the first network type are car-
ried out, changing the number of relu-renormalized layers(Rectified Linear Units-
re-normalized), varying the left and right contexts of tdnn layers, and hidden di-
mensions in each layer.
Both the network types were trained on the 40-dimensional MFCC acoustic
features. In addition to that, i-vectors of audios that have 100-dimensions were
also extracted according to the run_ivector_common script provided in wall street
journal Kaldi recipe. I-vector is a mapping from a variable-length speech segment
to a fixed-dimensional representation that captures the long-term characteristics
of the audio, such as the speaker characteristics or recording device. In ASR, it
provides an additional input along with the MFCC acoustic features to the TDNN
acoustic models, which helps the network learn to be robust to speaker and channel
variations.
The next choice of the neural net architecture was a TDNN network followed
by an LSTM layer. In recent literature, the experiments conducted for Russian
(Markovnikov et al., 2018) and Mandarin Chinese (Li and Wu, 2014) speeches
using combinations of LSTMs have resulted in higher accuracies. The experiments
of these hybrid architecture were conducted using both TDNN and Factored TDNN
network settings.
Implementation steps related to TDNN model trainings are depicted in the
figure 4.5.
30
Figure 4.5: Implementation steps related to TDNN model trainings
• “Antpc” server
Training of all deep neural architectures and the decoding of the models were
carried out on a single GPU - GeForce RTX 2080 Ti of 10.8GB provided by
the Antpc server. A GPU-based instance is used to access to NVIDIA GPUs,
thereby accelerate the deep learning training process by leveraging CUDA.
31
4.5 Summary
Throughout this chapter, the technical aspects related to implementation of the
baseline model and four other deep neural networks were discussed. In addition
to that, the tools and GPU specifications that enabled the implementation of the
experiments were addressed in brief. Chapter 5 will address the results and evaluate
the effects of these implementations for Sinhala ASR.
32
Chapter 5
The first section of this chapter will discuss the evaluation metrics that are used in
this research. In the second part, a detailed evaluation of the results experimented
on the baseline model and DNN models will be discussed.
• substitutions (sub)
33
• deletions (del) and
• insertions (ins)
Training the models involves a total data set from 70 speakers where 50 are female,
and 20 are males. The data set is split in the ratio 8:1:1 for train, validation, and
test data sets approximately. The training data set has audio recordings from 40
females and 16 males speakers, and the total utterances are 12295 sentences, which
is 25h of speech data. As the validation data set,1050 speech utterances from five
females and two male speakers are taken for fine-tuning the models. Testing the
models involve a data set from five female speakers and two male speakers where
they utter 1050 speech sentences altogether. Each of the validation and testing
data set is 1.6 hours long. The overall details about the data sets are given in
table 5.1.
34
5.2.2 Results of GMM-HMM model
35
Table 5.3: Results of pre-trained DNN models
#hidden
# hidden Activation WER% Valid WER% Test
# epochs units per
layers function set set
layer
The results display the improvement gained over the baseline GMM-HMM
model by lowering a WER of 2% or more in every model in the pre-trained DNN.
The lowest or the best WER is observed to be 39.92%, which is 2.72% lower than
the best WER of the baseline model. However, the results do not clearly show how
the unsupervised generative model affects the final performance of our targeted su-
pervised model. Because of this reason, as described in Chapter 4, a regular deep
neural network with randomly initialized weights was implemented. The results
generated for the non-pre-trained models with the same network configurations as
in the pre-trained models are depicted in the table 5.4.
36
Table 5.4: Results of non pre-trained DNN models
#hidden
# hidden Activation WER% Valid WER% Test
# epochs units per
layers function set set
layer
The results observed from the DNN models without pre-training are slightly
better when compared with the pre-trained models under the same network con-
figurations. The best WER was found to be 39.69%, which is only 0.23% lower
than the best pre-trained DNN result. However, both the results from the DNN
models surpass the performance of the baseline model. Another set of experiments
was conducted by changing the activation function of these models from sigmoid to
Tanh function. But the results observed were higher WERs for the same network
configuration except the learning rate.
Since the results gained so far remain in a higher WER, the next experiments
37
were conducted on a different deep network structure named TDNN, as described
in detail in Chapter 4. The results experimented on the two network types of
TDNNs are summarized in table 5.5.
As depicted in table 5.5, the lowest WER so far was observed from the TDNN
network type 2, which is 35.16%. With compared to baseline, pre-trained, and
non-pre-trained models, the WER has been decreased by 7.48%, 4.76%, and 4.53%,
respectively.
Another variation of TDNN network was experimented by layering an LSTM
on the top of TDNN layers. Experiments were conducted by layering the LSTM
layer on both the TDNN and factored TDNN layers. An experiment was conducted
by layering an LSTM on top of the best scored(35.16%) TDNN model. However,
the WER was increased up to 35.87%. The overall results observed from these
experiments are shown in table 5.6.
38
Table 5.6: Results of TDNN+LSTM models
512 -TDNN
4.43 36.23
4 1 384 -LSTM
256 -TDNN
4.80 36.33
384 -LSTM
256 - TDNNF
13 1 4.35 35.87
384 - LSTM
The results from TDNN+LSTM does not surpass the performance of TDNN
only models. This may due to the increase in the complexity of the models when
adding an LSTM layer. The number of parameters blows up a lot since LSTMs
makes four different projections from its input. Therefore, further experiments
adding lstm layers were not conducted. However, there is a possibility of slightly
changing the WERs when tuned with different network configurations.
In this research, four variations of deep neural networks and a statistical baseline
model were experimented for the task of speech recognition of the Sinhala language.
A comprehensive evaluation of these models will be conducted in this section.
Table 5.7 shows a summary of the best performance or the lowest WER obtained
from each model for the test data set.
Table 5.7: Summary of the best WERs obtained from all the models
WER% Test
DNN model
set
39
According to table 5.7, the TDNN network has shown the lowest WER, which
means it is the best network setting observed from the experiments conducted in
this research. Even the hybrid architecture of TDNN+LSTM shows a lower WER
than the regular DNNs, which highlights the factor that TDNNs perform much
better in speech recognition tasks.
To clearly distinguish how the performances of models have practically affected
the test audio files, a comparison of four translated sentences have been made in
figure 5.1.
40
Figure 5.1: Four translated example sentences based on baseline GMM-HMM, pre-
trained DNN, non-pre-trained DNN, TDNN, and TDNN+LSTM. Phrases in the
bold green text show the exact matching compared to the correct test sentence.
Phrases in the bold red text show the words that are incorrectly translated by the
models while the phrases highlighted in yellow shows the word segmentation issues
and slight deviations
41
When considering the test sentence 1, all the models have failed to correctly
translate the words "ඇඹ හ , "ක ෙ ෙරා්ටය." When analyzing the sentence, it was
found, these words are not used in the corpus, although they are included in the
lexicon. Therefore, the models have failed to find any relationship between the
words and have lead to incorrect translations. This signifies that the richness of
the text corpus, along with the lexicon, is an essential factor for developing a
robust speech recognition system. However, the TDNN type models has been able
to slightly translate those words, which is a noteworthy feature. This same scenario
has happened in the test sentence four also, where all the models have failed to
identify the word "හෑ වට ".
5.3 Summary
The results from each experimented model were presented in this chapter in detail.
Together with the results, an analysis of the WER scores and translated sentences,
comparing different experiments, was also discussed. The potential conclusions
that can be drawn from these results are presented in Chapter 6.
42
Chapter 6
Conclusions
6.1 Introduction
This thesis is on developing an automatic speech recognition system for the Sinhala
language by using deep learning techniques. Initially, this research started with
an in-depth look at the literature of the Automatic Speech Recognition(ASR). In
chapter 2, a comprehensive literature review was conducted to figure out the gaps
in this area of study. It becomes evident that less research has been done in the
field of Sinhala speech recognition. The reason for this is the limited resources to
continue a research study.
This chapter provides an overall picture of the conclusions drawn from the whole
research work conducted by us.
43
fewer resources, the training of deep neural networks were performed on the align-
ments generated using the baseline model rather than training immediately from
utterance level transcriptions.
In this research, we have trained four types of deep neural networks, namely pre-
trained DNN, non-pre-trained regular DNN, TDNN, and TDNN, followed by LSTM
(TDNN+LSTM). When evaluating the results, the highest WER was observed from
the baseline GMM-HMM model, which is 42.64%. This result is used for evaluat-
ing the performance of deep neural networks over the statistical approach. It was
observed that the results obtained from pre-trained and regular DNN are compar-
atively similar. As in table 5.3, the best WER from the pre-trained model, which
is 39.92% can be obtained using three hidden layers with 256 hidden units, 0.008
of the learning rate, and Sigmoid activation function. When experimented under
the same learning rate and activation function, the best WER of the regular DNN
was observed as 39.69% when trained on six hidden layers with 512 hidden units.
However, these two DNN models have a small improvement of 2.72% and 2.95%
when compared with the baseline model. While in table 5.5, it can be observed
that the TDNN model is dramatically better than the previous three models, in-
cluding the baseline model. This proves the efficiency of TDNN network settings
towards the task of speech recognition. The best WER achieved from the TDNN
model, which is 35.16% can be obtained from the factored TDNN setting with 13
TDNN layers, each having 256 hidden units. The performance of the best TDNN
model over the baseline and previous DNN models is 7.48%, 4.76%, and 4.53%,
respectively. Later, when trained with the hybrid architecture - TDNN+LSTM as
in table 5.6, the WER got from TDNN was increased slightly by 0.71% . This may
due to the model complexity getting increased compared to the limited data set.
A significant improvement of the WERs can be observed, especially when using
the factored form of TDNN layers where the architecture uses the resnet-type skips
rather than skip-splicing. In Chapter 5, as analyzed the performance by bringing
out example sentences from each model, it was observed that if the models can learn
broader temporal contexts, they are likely to identify the correct words for the given
audio frame. Thus, we could identify the Time Delay Neural Network(TDNN)
as one of the deep neural architectures that perform well for the Sinhala speech
44
recognition even with a limited data set of about 25hours. The results obtained
show that the Deep neural network architectures exceed the baseline - GMM-HMM
performance model with a maximum WER of 7.48% on the test data set.
6.4 Limitations
A moderate vocabulary of Sinhalese speech data is used for all the training con-
ducted in the research. As this research aims in developing an ASR system that
works for the general domain, it needs to have an adequate amount of speech
data as well as a rich text corpus. However, the used data set was created by
UCSC LTRL, and it was extended further by collecting recordings from time to
time. Therefore, the training of models had to be repeated several times with the
extended data set, which required a lot of time and effort.
Initially, the research experiments had to be conducted using the GCP per-
sonal account with minimal computational power until the University provides the
computational facility. However, with the demand increases for computational
resources, the jobs had to be kept in the waiting queue for several days.
45
Since it is impractical to train on every possible network configuration related
to a particular model, the hidden layers were incrementally increased from 2 to 7,
observing 256,512 and 1024 hidden dimensions in DNNs. In TDNN networks,the
number of layers and hidden units were changed randomly based on the results.
These network configurations were mostly based on the literature related to DNN
training.
Understanding the Kaldi coding style, I/O internals, data structures, and pro-
cess communication styles is a tedious task compared to other frameworks that are
written in higher-level languages such as Java. It took several months to get famil-
iar to the framework thoroughly. The reason for this is the Kaldi core is written
in C++, and executable programs are consolidated in bash scripts, which are from
little to no readable.
46
References
Deng, l., Hinton, G. and Kingsbury, B. (2013), New types of deep neural network
learning for speech recognition and related applications: An overview, pp. 8599–
8603.
Du, X., Cai, Y., Wang, S. and Zhang, L. (2016), Overview of deep learning, in ‘2016
31st Youth Academic Annual Conference of Chinese Association of Automation
(YAC)’, pp. 159–164.
Fohr, D., Mella, O. and Illina, I. (2017), New Paradigm in Speech Recognition:
Deep Neural Networks, in ‘IEEE International Conference on Information Sys-
tems and Economic Intelligence’, Marrakech, Morocco.
URL: https://hal.archives-ouvertes.fr/hal-01484447
Hinton, G., Deng, l., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. and Kingsbury, B. (2012), ‘Deep neural
47
networks for acoustic modeling in speech recognition: The shared views of four
research groups’, Signal Processing Magazine, IEEE 29, 82–97.
Hsu, W.-N., Zhang, Y. and Glass, J. (2016), A prioritized grid long short-term
memory rnn for speech recognition, pp. 467–473.
Huang, X., Zhang, W., Xu, X., Yin, R. and Chen, D. (2019), ‘Deeper time delay
neural networks for effective acoustic modelling’, Journal of Physics: Conference
Series 1229, 012076.
Kalchbrenner, N., Danihelka, I. and Graves, A. (2015), ‘Grid long short-term mem-
ory’.
Li, X. and Wu, X. (2014), ‘Constructing long short-term memory based deep
recurrent neural networks for large vocabulary speech recognition’, CoRR
abs/1410.4281.
URL: http://arxiv.org/abs/1410.4281
Liu, B., Zhang, W., Xu, X. and Chen, D. (2019), ‘Time delay recurrent neu-
ral network for speech recognition’, Journal of Physics: Conference Series
1229, 012078.
48
Markovnikov, N., Kipyatkova, I., Karpov, A. and Filchenkov, A. (2018), Deep
neural networks in russian speech recognition, pp. 54–67.
Peddinti, V., Povey, D. and Khudanpur, S. (2015), A time delay neural network ar-
chitecture for efficient modeling of long temporal contexts, in ‘INTERSPEECH’.
Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M. and Khudan-
pur, S. (2018), Semi-orthogonal low-rank matrix factorization for deep neural
networks, pp. 3743–3747.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hanne-
mann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G. and
Vesel, K. (2011), ‘The kaldi speech recognition toolkit’, IEEE 2011 Workshop
on Automatic Speech Recognition and Understanding .
Saurav, J., Amin, S., Kibria, S. and Rahman, M. (2018), Bangla speech recognition
for voice search, pp. 1–4.
49
wikiS inhala(n.d.), .
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y. and
Courville, A. C. (2017), ‘Towards end-to-end speech recognition with deep con-
volutional neural networks’, CoRR abs/1701.02720.
URL: http://arxiv.org/abs/1701.02720
50
Appendices
51
Appendix A
Model Specifications
The network specifications of the pre-training process are depicted in table A.1.
#hidden
# RBM hidden RBM learning Lower RBM
units per
layers rate learning rate
layer
The network specifications of the best TDNN model are represented in table
A.2.
52
Table A.2: Network specifications of the best TDNN model
TDNN
Network type # Dimensions # Epochs
layer information
relu-batchnorm-layer name=tdnn1
tdnnf-layer name=tdnnf2 time-stride=1
tdnnf-layer name=tdnnf3 time-stride=1
tdnnf-layer name=tdnnf4 time-stride=1
tdnnf-layer name=tdnnf5 time-stride=0
tdnnf-layer name=tdnnf6 time-stride=3
2 tdnnf-layer name=tdnnf7 time-stride=3 256 10
tdnnf-layer name=tdnnf8 time-stride=3
tdnnf-layer name=tdnnf9 time-stride=3
tdnnf-layer name=tdnnf10 time-stride=3
tdnnf-layer name=tdnnf11 time-stride=3
tdnnf-layer name=tdnnf12 time-stride=3
tdnnf-layer name=tdnnf13 time-stride=3
53
Table A.3: Network specifications of TDNN+LSTM model
# TDNN # LSTM
layer information # Epochs
dimensions dimensions
relu-batchnorm-layer name=tdnn1
tdnnf-layer name=tdnnf2 time-stride=1
tdnnf-layer name=tdnnf3 time-stride=1
tdnnf-layer name=tdnnf4 time-stride=1
tdnnf-layer name=tdnnf5 time-stride=0
tdnnf-layer name=tdnnf6 time-stride=3
tdnnf-layer name=tdnnf7 time-stride=3
256 384 10
tdnnf-layer name=tdnnf8 time-stride=3
tdnnf-layer name=tdnnf9 time-stride=3
tdnnf-layer name=tdnnf10 time-stride=3
tdnnf-layer name=tdnnf11 time-stride=3
tdnnf-layer name=tdnnf12 time-stride=3
tdnnf-layer name=tdnnf13 time-stride=3
lstm-layer name=lstm3 decay-time=20 delay=-3
54
Appendix B
Decoded text
Figure B.1: Three other translated sentences based on baseline GMM-HMM, pre-
trained DNN, non-pre-trained DNN, TDNN, and TDNN+LSTM. Phrases in the
bold green text show the exact matching compared to the correct test sentence.
Phrases in the bold red text show the words that are incorrectly translated by the
models while the phrases highlighted in yellow shows the word segmentation issues
and slight deviations
55