1087

This paper presents a deep learning method for assessing non-native English pronunciation using a tunable Siamese network to extract phone distance metrics from audio sequences. The proposed system aims to improve upon traditional model-based approaches by being more adaptable and sensitive to individual phone instances, ultimately predicting proficiency scores based on the extracted features. The study compares the performance of this deep learning approach against existing methods and discusses the implications for automatic pronunciation assessment in language learning contexts.

Uploaded by

ramon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

1087

Uploaded by

ramon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Interspeech 2018

2-6 September 2018, Hyderabad

A deep learning approach to assessing non-native pronunciation of English

using phone distances
Konstantinos Kyriakopoulos, Kate M. Knill, Mark J.F. Gales

ALTA Institute / Engineering Department

Cambridge University
Trumpington St, Cambridge CB2 1PZ, UK
{kk492, kate.knill, mjfg}@eng.cam.ac.uk

Abstract Section 2 provides an overview of the approaches employed

to assess pronunciation in the literature and explains the method
The way a non-native speaker pronounces the phones of a lan- of model-based phone distance features, used here as a base-
guage is an important predictor of their proficiency. In grading line. Section 3 demonstrates how the phone distance concept
spontaneous speech, the pairwise distances between generative can be expanded upon by replacing the model-based approach
statistical models trained on each phone have been shown to be with a deep, tunable feature extractor based on Siamese net-
powerful features. This paper presents a deep learning alter- works, while section 4 shows how these can be integrated in an
native to model-based phone distances in the form of a tunable end-to-end system to predict grade. Section 5 presents the data
Siamese network feature extractor to extract distance metrics di- and speech recognition system used in the experiments, while
rectly from the audio frame sequence. Features are extracted at Sections 6 and 7 present the results and conclusions.
the phone instance level and combined to phone-level represen-
tations using an attention mechanism. Pair-wise distances be-
tween phone features are then projected through a feed-forward 2. Phone distance features
layer to predict score. The extraction stage is initialised on ei-
ther a binary phone instance-pair classification task, or to mimic Approaches in the literature to pronunciation assessment in-
the model-based features, then the whole system is fine-tuned clude comparison to native speaker models [6, 7, 8] and au-
end-to-end, optimising the learning of the distance metric to tomatic speech recogniser (ASR) confidence measures (usu-
the score prediction task. This method is therefore more adapt- ally on alignment tasks) such as Goodness of Pronunciation
able and more sensitive to phone instance level phenomena. Its (GOP) [9, 10, 7, 6]. The problem with both of these approaches
performance is compared against a DNN trained on Gaussian is that they generally require prior knowledge of the exact text
phone model distance features. the speaker is saying (to identify comparable native models in
Index Terms: pronunciation assessment, phone distances, the former case and to give meaning to the ASR confidence
CALL, CAPT, Siamese Networks, attention mechanism scores in the latter). For this reason, most systems in the litera-
ture rely on ”read aloud” tasks with known transcriptions. Open
responses to questions, however, give a better indication of the
1. Introduction learner’s proficiency than read speech. When dealing with the
The growing global demand for foreign language learning [1], resulting spontaneous speech the candidate’s audio must first
combined with recent advances in computing power, speech be passed through an ASR, to determine what the speaker said.
processing and machine learning, has driven an increased in- The recognised text is then used together with the audio for
terest in Computer Assisted Language Learning (CALL) and feature extraction to form the input to the automatic grader.
auto-marking, in particular the automatic assessment of non- This introduces the problem of the ASR’s word and phone er-
native speaker proficiency [2, 3, 4]. ror rates, which can be particularly high when assessing low
Pronunciation is a key predictor of proficiency, and is ex- proficiency non-native speakers, and to which any assessment
pected to become more native, reducing strain to the listener system must therefore be robust.
caused by L1 effects, as the learner progresses up the CEFR To overcome these issues, more recent approaches to
levels [5]. There is a broad literature on the definition of good pronunciation assessment have utilised the distances between
pronunciation as it pertains to automatic assessment, with vari- phones [11, 12, 13, 14]. Rather than characterising each phone
ations in terminology. In this work, pronunciation is defined as by the distribution of acoustic features in its articulations, each
the manner in which each word of an utterance is rendered as phone is defined relative to the pronunciation of each of the oth-
series of phones, distinguishing it from message construction, ers (Figure 1).
which relates to the choice of words, and prosody, which re-
lates to other acoustic properties of speech (i.e. tempo, rhythm
and stress). Good pronunciation consists of knowing the correct
phone sequence for each word and then rendering those phones
in an acceptable manner. A speaker can thus be assessed by
the frequency in their speech of lexical errors (e.g. pronounc-
ing the silent b in subtle), or by the general way in which they
pronounce the phones of the language (e.g. consistently mis-
pronouncing /v/ as /b/). It is this latter accent quality factor,
represented by a proficiency score, that this paper concerns. Figure 1: Illustration of the phone distance concept

1626 10.21437/Interspeech.2018-1087
The full set of phone-pair distances describes the speaker’s
(f,i)T (b,i)T T
overall accent. These features should thus robustly represent hi = [hT −1 , h0 ] (6)
pronunciation in a way that is compact and independent of
speaker attributes. Kyriakopoulos et al. [13] proposed a model-
based phone distance feature system for pronunciation assess-
ment. Gaussian models are built for each phone and the K-L
divergences between them used to determine the speaker’s ac-
cent quality. This system is taken as the baseline in this paper.
Consider a speaker n whose utterances have been recog-
nised and time-aligned to a series of In phone instances
π0 , π1 , ..., πIn −1 , each corresponding to one of the 47 English
phones ω0 , ω1 , ..., ω46 (/aa/, /ax/, /ah/ etc. 1 ).
Each phone instance πi is itself a sequence of Ti frames
(i) (i) (i) (i)
x0 , x1 , ..., xTi −1 , where xt is the vector of perceptual lin- Figure 2: Standard, centre and attention-based sequence-to-
ear prediction (PLP) features extracted from frame t. Each vector bi-RNNs.
phone ωφ is represented by the parameters of a multivariate
(n) (n) This approach is problematic in this case, however, since
Gaussian model (µφ ,Σφ ), which predicts the PLP features
the first and last frames of each phone instance are actually the
produced each time the speaker utters an instance of:
least representative of its content (since they are on the align-
(i) (i) (n) (n) ment boundary). Two alternative methods are therefore consid-
p(xt |πi = ωφ ) = N (xt ; µφ , Σφ ) (1) ered:
Distances between models are represented using symmetric
1. Using the middle, rather than final, frame of each pass
KullbackLeibler (K-L) divergence:
(Figure 2, middle)

(f,i)T (b,i)T
(n) 1 (n) (n) (n) (n)
hi = [hT /2 , hT /2 ]T (7)
Dφ,ψ = hKL N (µφ , Σφ )||N (µψ , Σψ )
2 2. Using an attention mechanism to weight the importance
(n) (n)
+ KL N (µψ , Σψ )||N (µφ , Σφ ) i
(n) (n)
(2) of each frame (Figure 2, right)

Ti −1 Ti −1
Each speaker is thus represented by 1081 scalar phone-pair X (f,i) (f,i)
X (b,i) (b,i)
(n) (n) (n) hi = αt ht + αt ht (8)
distances D0,1 , D0,2 , ..., D46,47 , together forming the vector t=0 t=0
Dn . These features are then used to train a deep neural network
(DNN) to predict human-assigned proficiency scores: where
(f,i)
sn = f (Dn ) (3) (f,i) exp ct
αt = PT −1 (f,i) (b,i)
(9)
i
t=0 exp ct + exp ct
3. Siamese phone distance features (f,i) (f,i)
ct = f (ht , λα ) (10)
This paper proposes an alternative phone distance feature ap- (b,i)
proach to the model-based method described in the previous and similarly for αt (with same λα ).
section. The generative Gaussian model of each phone is re- Whichever of the three methods is used, the resultant network
placed with a feature vector projected up from the frame se- must be trained across all instances across all speakers to map
quences. This feature extractor is tunable and can be fine tuned frame sequences to a space in which distances between phones
to the task of automatic assessment, unlike the more general are most indicative of proficiency.
model-based approach. This also addresses a couple of issues Having defined a projection from the original audio frame
with the model-based phone distance feature approach above. sequences to a phone instance feature space, it is now necessary
By localising the phone representation to the level of the to use these projections to learn a phone distance metric. To
individual phone instance the amount of data needed to extract this end, Siamese networks are employed. A Siamese network
the features is reduced. In addition, the features should be more is composed of two copies of the same neural network, each fed
interpretable. The first step is to project the frame sequence with one of the elements of a pair of input samples. These iden-
(i) (i) (i)
x0 , x1 , ..., xTi −1 of each phone instance i to a fixed-length tical networks project the samples into an embedding space. A
vector representation hi by passing it through a bi-directional measure of distance is then computed between the two samples
Long Short Term Memory (LSTM) network: depending on their relation label (same or different). The er-
ror is propagated evenly in the two copies. The architecture is
(f,i) (i) (f,i)
ht = f (xt , ht−1 , λ(f,i) ) (4) based on the LSTM Siamese architecture for learning difference
metrics between pairs of variable length sequences presented by
(b,i) (i) (b,i) Mueller and Thyagarajan[16].
ht = f (xt , ht+1 , λ(b,i) ) (5)
Consider a pair of phone instances πi and πj , taken from
The standard mechanism of obtaining a fixed-length vector the same speaker n, of phones ωφ and ωψ respectively. Using
hi from the forward and backward sequences of a Recurrent the method from the previous section, they are projected to vec-
Neural Network (RNN) is to concatenate the two sequences’ tors hi and hj . The distance between the two instances can
final time steps (Figure 2, left): therefore be represented as:
1 based on ARPABET phone set [15] di,j = ||hi − hj ||2 (11)

1627
Two approaches are considered for training this metric to
(n)
approximate the distance between the two instances. The first (n) exp ci
αi = P (18)
is simply to train di,j to approximate the model-based K-L di- exp ci
(n)
(n) i:πi =ωφ
vergence distance Dφ,ψ for phones φ and ψ in speaker n. The
second is to use di,j to predict whether the two instances are (n)
ci
(n)
= g(hi ) (19)
instances of the same phone:
where g is a word-level LSTM, since each instance’s sur-
rounding context is expected to affect how important it
1, ωφ = ωψ
cij = (12) is to score.
0 ωφ 6= ωψ
The distance feature between each pair of phones ωφ , ωψ
This is done by passing di,j through a sigmoid to derive the
can now be calculated as:
probability pij that cij = 1
(n)
2 dφ,ψ = ||hφ − hψ ||2 (20)
pij = −1 (13)
1 + exp(−dij ) resulting in 1081 scalar distances
(n) (n) (n)
d0,1 , d0,2 , ..., d46,47 ,
to-
Given the large number of possible instance pairs for each gether forming vector dn which is passed through a feed-
(n) (n) (n) (n) (n) (n)
speaker n, M pairs (l0 , r0 ), (l1 , r1 ), ..., (lM −1 , rM −1 ) forward layer to predict the score:
are sampled for use in training (in the experiments in this paper
M = 100). In the K-L training case sampling is completely sn = f (dn ) (21)
random, while in the binary case M 2
pairs of instances of the End-to-end training of the whole system (after initialisation
same phone and M 2
pairs of instances of different phones are using the Siamese network training) with the MSE criterion can
sampled. now be performed using all N speakers in the training set:
If the M pairs have corresponding phone labels ( )
N −1
(n) (n) (n) (n) (n) (n)
(φ0 , ψ0 ), (φ1 , ψ1 ), ..., (φM −1 , ψM −1 ), the objec- 1 X 2
min (sn − f (dn )) (22)
tive functions over all N speakers in the training becomes, for N n=0
K-L training:
5. Experimental Setup
( N −1 M −1
)
X X (n) The preceding sections of this paper have described a system for
min log|dl(n) ,r(n) − D (n) (n) (14) predicting pronunciation proficiency score based on the audio
m m φm ,ψm
n=0 m=0
frames and aligned phone sequence of the utterances produced
and for binary training: by a candidate. The data for initialising, training and testing
this system are obtained from candidate responses to the spo-
ken component of the Business Language Testing Service (BU-
(
N
X −1 M
X −1
log(pl(n) ,r(n) ),
(n)
φm = ψm
(n) LATS) for foreign learners of English, provided by Cambridge
min m m
(n) (n) (15) English Language Assessment. The BULATS speaking test has
n=0 m=0
log(1 − pl(n) ,r(n) ) φm 6= ψm five sections, all related to business scenarios [17]. Section A
m m
consists of short responses to prompted questions. Candidates
The vector hi for each instance i can now be said to repre- read 8 sentences aloud in Section B. Sections C-E consist of
sent a space in which Euclidean distance is representative of the spontaneous responses of several sentences in length to a series
conceptual distance between phones. This representation must of spoken and visual prompts. Candidates are scored on a scale
now be combined from the phone instance to the overall phone from 0 to 30, based on their overall proficiency, and it is this
level for each speaker and finally projected to predict speaker score that the system is predicting.
score. The systems are trained on a gender and proficiency level
balanced mixed L1 dataset (TRN) consisting of 994 speakers
4. Predicting grade (first languages Polish, Vietnamese, Arabic, Dutch, French and
Thai), scored on their overall proficiency (not just pronuncia-
Once trained, the systems described in the previous section can
tion) by human graders and evaluated on a held out evaluation
extract distance features at the level of phone instance pairs.
set (EVL), consisting of 226 speakers of a similar mix of L1s,
The next step is to integrate these systems into an end-to-end
gender and proficiency, with scores provided by expert human
system for predicting a speaker’s proficiency score.
(n) graders.
The projected phone instance features hi for all instances As discussed in Section 2, the first step before passing the
(n) (n)
πi of phone ωφ in speaker n, can be combined to derive an date through the system is recognising the text being spoken and
(n)
overall phone vector hφ in one of two ways: aligning the audio to a sequence of phones. Both these tasks are
performed using an automatic speech recogniser (ASR). Due to
1. By simple averaging: the incorrect pronunciations, grammar and rhythm, related to
the speaker’s proficiency level and first language (L1), the ac-
1 X
hφ
(n)
= P
(n)
hi (16) curacy of standard commercial ”off-the-shelf” ASR systems is
i:πi =ωφ 1 i:πi =ωφ too low for non-native learner English. Instead, the ASR system
from Kyriakopoulos et al. [13] (also described in Van Dalen et
2. Via another attention mechanism: al.[18]), which is trained on non-native learners of English, is
X used. This ASR has an overall word error rate (WER) of 47.5%
hφ
(n)
=
(n)
αi hi
(n)
(17) and a phone error rate (PER) of 33.9%, evaluated against crowd
i:πi =ωφ sourced transcriptions of EVL.

1628
6. Experimental Results now employed to predict proficiency scores. The system is con-
nected end-to-end using the averaging and attention methods
Sections 3 and 4 describe a pronunciation assessment system
and further trained to predict grade. The mean squared error
in two stages. First, an LSTM of either the standard, centre or
(MSE) results of these experiments are presented in Table 4.
attention variety is trained, in a Siamese architecture, to extract
phone instance features to predict, for a given phone instance MSE
pair, either a binary or a K-L divergence distance metric. Next, Initialisation
Average Attention
the trained LSTM is integrated into an end-to-end score predic-
Binary 19.7 17.6
tion system, using either an averaging or attention mechanism
K-L 16.4 14.2
to move from the phone instance level to the phone level.
Given the above, it is necessary to evaluate, using the setup Table 4: Performance (mean squared error of predicted to
described in Section 5, first, how well the system performs at human-assigned scores) of baseline and Siamese systems,
the initialisation tasks, second, which of the proposed architec- trained using binary and K-L divergence criteria, with or with-
tures (standard vs. centre vs. attention LSTM, binary vs. K-L out the extra attention layer and fine tuning, each trained on
training, averaging vs. attention combination) is best perform- TRN and evaluated on EVL
ing and, third, how the best architecture performs relative to the All systems are able to predict score with a reasonable
baseline. amount of accuracy and, as expected, adding the attention
First, the Siamese networks presented in Section 3 are mechanism improves performance. The systems which were
trained on randomly selected pairs from all the speakers in the initialised using the K-L divergences from the baseline method
TRN data set. They are evaluated on similar pairs from the EVL outperform those initialised using the binary classifier, which is
data set for the two tasks of binary (same vs. different) classi- to be expected given the superior granularity of K-L divergences
fication and predicting the K-L divergences from the baseline compared to the binary variant.
model. Table 1 shows the results for the standard bi-LSTM con- Finally, the performance of the best architecture (attention
figuration where the final time steps are concatenated to form LSTM, K-L criterion, attention combination), is compared to
the fixed length phone instance representation hi . that of the baseline (Table 5). The system outperforms the base-
line in terms of MSE but is comparable for PCC. This can be
Criterion Binary Accuracy K-L Performance explained given that the new system is optimised end-to-end for
Binary 75.0% 0.599 minimum MSE, whereas for the baseline only the grader is op-
K-L 68.0% 0.789 timised for minimum MSE.
Table 1: Performance of standard bi-LSTM configuration Initialisation MSE PCC
Siamese networks
Baseline 14.8 0.785
As expected, the networks perform better on the task they End-to-end system 14.2 0.780
are trained for than on the other task. Both systems perform Table 5: Performance (mean squared error and Pearson cor-
well, suggesting that the Siamese networks are capable of ex- relation coefficient of predicted to human-assigned scores) of
tracting interpretable distance metrics indicative of both the baseline and Siamese systems, trained using binary and K-L di-
clustering together of instances of the same phone as well as vergence criteria, each trained on TRN and evaluated on EVL
the distances between distributions of different phones. Fur-
ther, the fact that the system trained for each task also performs 7. Conclusions
reasonably on the other task, suggests that these two concepts
of distance are closely related, as was to be expected. Phone distance features have previously been shown to be a
The experiments are now repeated all three bi-LSTM vari- good indicator of accent pronunciation quality allowing use
ants (standard, centre and attention) of the phone instance rep- in assessing the proficiency of a non-native learner’s speech,
resentation to determine which is best. As expected, and can in particular, for assessment of spontaneous spoken responses.
be seen in Tables 2 and 3, the attention LSTM performs the This paper has proposed an alternative to the model-based ap-
best overall, with the centre time steps configuration better for proaches to phone distances based on Siamese networks.
matched criteria than the standard configuration. The attention It was first shown how Siamese networks can be used to ex-
bi-LSTM is slower to train but is used going forward due to this tract distance metrics between pairs of phone instances. These
better representation. can be used to predict whether the two are instances of the same
or different phones, as well as the relative entropies between the
Criterion Std Cen Att distributions of their two phones if they are instances of differ-
Binary 75.0% 77.5% 77.3% ent phones. The distance measure is tunable, performing dif-
K-L 68.0% 67.4% 69.0% ferently on different tasks depending on how it is trained. The
Table 2: Binary accuracy of standard (Std), centre (Cen) and latter paradigm, which involves calculating model-based rela-
attention (Att) bi-LSTM Siamese network configurations tive entropies before training, was shown to produce a superior
distance measure. Of the three different architectures consid-
ered, that involving an attention mechanism outperformed the
Criterion Std Cen Att standard and centre-based sequence-to-vector architectures.
Binary 0.599 0.587 0.602 Finally, the networks were also used to develop a profi-
K-L 0.789 0.792 0.788 ciency grader, alternatively using an averaging and attention
Table 3: K-L performance of standard (Std), centre (Cen) and mechanism to move from the phone instance level to the phone
attention (Att) bi-LSTM Siamese network configurations level, with the latter proving superior. Trained in an end-to-end
fashion this grader was able to predict human-assigned profi-
Having established that the Siamese networks seem to in- ciency scores with performance surpassing model-based phone
deed be extracting valid distance features, these features are distance features.

1629
8. References
[1] D. Graddol, English Next. British Council, 2006.
[2] S. M. Witt, “Automatic error detection in pronunciation training:
Where we are and where we need to go,” in Proc. IS ADEPT,
vol. 6, 2012.
[3] S. Detey, L. Fontan, and T. Pellegrini, “Traitement de la pronon-
ciation en langue étrangère: approches didactiques, méthodes au-
tomatiques et enjeux pour lapprentissage,” Revue Traitement Au-
tomatique des Langues, 2016.
[4] R. van Dalen, K. Knill, and M. Gales, “Automatically Grading
Learners’ English Using a Gaussian Process,” in SLaTE, 2015.
[5] Council of Europe, Common European Framework of Reference
for Languages: Learning, Teaching, Assessment. Cambridge
University Press, 2001.
[6] N. Moustroufas and V. Digalakis, “Automatic pronunciation eval-
uation of foreign speakers using unknown text,” CSL, vol. 21, pp.
219–230, 2007.
[7] A. Metallinou and J. Cheng, “Using Deep Neural Networks to
Improve Proficiency Assessment for Children English Language
Learners,” in INTERSPEECH, 2014, pp. 1468–1472.
[8] M. Nicolao, A. V. Beeston, and T. Hain, “Automatic assessment
of english learner pronunciation using discriminative classifiers,”
in ICASSP. IEEE, 2015, pp. 5351–5355.
[9] S. Witt and S. Young, “Phone-level pronunciation scoring and
assessment for interactive language learning,” SPEECHCOM,
vol. 30, pp. 95–108, 2000.
[10] H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen, “Combi-
nation of machine scores for automatic grading of pronunciation
quality,” SPEECHCOM, 2000.
[11] N. Minematsu, S. Asakawa, and K. Hirose, “Structural represen-
tation of the pronunciation and its use for CALL,” in SLT, 2006,
pp. 126–129.
[12] S. Kasahara, N. Minematsu, H. Shen, D. Saito, and K. Hi-
rose, “Structure-based prediction of english pronunciation dis-
tances and its analytical investigation,” in Information Science and
Technology (ICIST), 2014 4th IEEE International Conference on.
IEEE, 2014, pp. 332–336.
[13] K. Kyriakopoulos, M. Gales, and K. Knill, “Automatic Character-
isation of the Pronunciation of Non-native English Speakers using
Phone Distance Features,” in SLATE, 2017.
[14] K. Knill, M. Gales, K. Kyriakopoulos, A. Ragni, and Y. Wang,
“Use of Graphemic Lexicons for Spoken Language Assessment,”
in INTERSPEECH, 2017, pp. 2774–2778.
[15] D. Jurafsky and J. H. Martin, Speech and Language Processing.
New Jersey: Prentice-Hall, 2000.
[16] J. Mueller and A. Thyagarajan, “Siamese Recurrent Architectures
for Learning Sentence Similarity,” in AAAI, 2016, pp. 2786–2792.
[17] L. Chambers and K. Ingham, “The BULATS online speak-
ing test,” Research Notes, vol. 43, pp. 21–25, 2011.
[Online]. Available: http://www.cambridgeenglish.org/images/
23161-research-notes-43.pdf
[18] R. C. van Dalen, K. M. Knill, P. Tsiakoulis, and M. J. F. Gales,
“Improving multiple-crowd-sourced transcriptions using a speech
recogniser,” in ICASSP, Apr 2015.

1630

Language Thesis
No ratings yet
Language Thesis
14 pages
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
No ratings yet
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
282 pages
2015_GOP
No ratings yet
2015_GOP
6 pages
Deep Learning For Assessment of Oral Reading Fluency: Mithilesh Vaidya, Binaya Kumar Sahoo, Preeti Rao
No ratings yet
Deep Learning For Assessment of Oral Reading Fluency: Mithilesh Vaidya, Binaya Kumar Sahoo, Preeti Rao
6 pages
Automatic Pronunciation Assessment For Language Learners With Acoustic-Phonetic Features
No ratings yet
Automatic Pronunciation Assessment For Language Learners With Acoustic-Phonetic Features
8 pages
2104.01378v2
No ratings yet
2104.01378v2
5 pages
Speech Communication: Mostafa Shahin, Beena Ahmed
No ratings yet
Speech Communication: Mostafa Shahin, Beena Ahmed
15 pages
2205.03432v1 (1)
No ratings yet
2205.03432v1 (1)
5 pages
Automatic Detection of Phone-Level Mispronunciation For Language Learning
No ratings yet
Automatic Detection of Phone-Level Mispronunciation For Language Learning
4 pages
A Speech Recognition
No ratings yet
A Speech Recognition
19 pages
Assessing Pronunciation
No ratings yet
Assessing Pronunciation
28 pages
Analysis and Utilization of MLLR Speaker Adaptation Technique For Learners' Pronunciation Evaluation
No ratings yet
Analysis and Utilization of MLLR Speaker Adaptation Technique For Learners' Pronunciation Evaluation
4 pages
A Template Matching Approach For Detecting Pronunciation Mismatch
No ratings yet
A Template Matching Approach For Detecting Pronunciation Mismatch
6 pages
The Fluency Pronunciation Trainer: Maxine Eskenazi and Scott Hansma
No ratings yet
The Fluency Pronunciation Trainer: Maxine Eskenazi and Scott Hansma
6 pages
Testing Language Sub-Skills: GRE EN
No ratings yet
Testing Language Sub-Skills: GRE EN
30 pages
2000decision Tree Based Text-To-Phoneme
No ratings yet
2000decision Tree Based Text-To-Phoneme
4 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Automatic Analysis of Pronunciations For Children With Speech Sound Disorders
No ratings yet
Automatic Analysis of Pronunciations For Children With Speech Sound Disorders
23 pages
Automatic Pronunciation Assessment in A Flipped Classroom Context
No ratings yet
Automatic Pronunciation Assessment in A Flipped Classroom Context
18 pages
Using Apps For Pronunciation Training: An Empirical Evaluation of The English File Pronunciation App
No ratings yet
Using Apps For Pronunciation Training: An Empirical Evaluation of The English File Pronunciation App
24 pages
Phoneme Recognition for Pronunciation Improvement
No ratings yet
Phoneme Recognition for Pronunciation Improvement
21 pages
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
No ratings yet
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
5 pages
Exemplar
No ratings yet
Exemplar
3 pages
2022.bea-1.4
No ratings yet
2022.bea-1.4
5 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Pronunciation Adaptation 2002
No ratings yet
Pronunciation Adaptation 2002
26 pages
2020.alta-1.6
No ratings yet
2020.alta-1.6
11 pages
Meta-Learning For Phonemic Annotation of Corpora
No ratings yet
Meta-Learning For Phonemic Annotation of Corpora
8 pages
Scoring Children's Foreign Language Production
No ratings yet
Scoring Children's Foreign Language Production
4 pages
Anomaly Detection Based Pronunciation Verification: Presented by-P.Naresh Babu Roll No:18EC01003
No ratings yet
Anomaly Detection Based Pronunciation Verification: Presented by-P.Naresh Babu Roll No:18EC01003
24 pages
Project Report: "In Pursuit of Global Competitiveness"
75% (4)
Project Report: "In Pursuit of Global Competitiveness"
9 pages
Approaches To Automatic Lexicon Learning With Limited Training Examples
No ratings yet
Approaches To Automatic Lexicon Learning With Limited Training Examples
4 pages
Speaking in Shorthand Steven Greenberg
No ratings yet
Speaking in Shorthand Steven Greenberg
18 pages
2008.08647
No ratings yet
2008.08647
5 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
2021.conll-1.51
No ratings yet
2021.conll-1.51
13 pages
CCS369 - TSS-Unit 4
No ratings yet
CCS369 - TSS-Unit 4
30 pages
interspeech15
No ratings yet
interspeech15
4 pages
Ppt Language Testing Meeting 10
No ratings yet
Ppt Language Testing Meeting 10
57 pages
Neurocomputing: Mario Malcangi, David Frontini
No ratings yet
Neurocomputing: Mario Malcangi, David Frontini
10 pages
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
No ratings yet
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
6 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Electronics 10 00235 v2
No ratings yet
Electronics 10 00235 v2
22 pages
English Alphabet Recognition With Telephone Speech: Bid, Pit, Viz IS. Is, Pit, BID Viz
No ratings yet
English Alphabet Recognition With Telephone Speech: Bid, Pit, Viz IS. Is, Pit, BID Viz
10 pages
Pronunciation Test: Testing and Evaluation Miss Norhayati BT Zain
0% (1)
Pronunciation Test: Testing and Evaluation Miss Norhayati BT Zain
26 pages
1j92g768v-Phonics Assessment Presentation
No ratings yet
1j92g768v-Phonics Assessment Presentation
27 pages
Towards A Model of The Mapping
No ratings yet
Towards A Model of The Mapping
147 pages
Document
No ratings yet
Document
6 pages
Summary Chapter 7 and 9 - Nguyen Thi My Duyen - 1967 012 049
No ratings yet
Summary Chapter 7 and 9 - Nguyen Thi My Duyen - 1967 012 049
21 pages
Mroz 2018
No ratings yet
Mroz 2018
21 pages
Gruhn Et Al - Statistical Pronunciation Modeling For Non-Native Speech Processing
No ratings yet
Gruhn Et Al - Statistical Pronunciation Modeling For Non-Native Speech Processing
118 pages
First Language Phonetic Drift During Second Language Acquisition
No ratings yet
First Language Phonetic Drift During Second Language Acquisition
244 pages
IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 E-Mail
No ratings yet
IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 E-Mail
4 pages
journal-fix-lai-r-nha
No ratings yet
journal-fix-lai-r-nha
15 pages
Speech Lab - Project Report
100% (1)
Speech Lab - Project Report
44 pages
An Automatic Speech Recognition System Using Neural Networks and Linear Dynamic Models To Recover and Model Articulatory Traces
No ratings yet
An Automatic Speech Recognition System Using Neural Networks and Linear Dynamic Models To Recover and Model Articulatory Traces
4 pages
LTE 5 - Pronunciation and Speaking Test - Compressed
No ratings yet
LTE 5 - Pronunciation and Speaking Test - Compressed
29 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Library of Congress Cataloging-in-Publication Data: Blackwell Publishing
No ratings yet
Library of Congress Cataloging-in-Publication Data: Blackwell Publishing
3 pages
2 BLDG A-C - RCP - Fire Plan (NADIF)
No ratings yet
2 BLDG A-C - RCP - Fire Plan (NADIF)
11 pages
Model Question Paper - 1 With Effect From 2020-21 (CBCS Scheme) Fifth Semester B.E. Degree Examination
No ratings yet
Model Question Paper - 1 With Effect From 2020-21 (CBCS Scheme) Fifth Semester B.E. Degree Examination
10 pages
Rice Direct Seeding Experiences, Challenges and Opportunities
No ratings yet
Rice Direct Seeding Experiences, Challenges and Opportunities
12 pages
Scrib 1
No ratings yet
Scrib 1
5 pages
Electronic Reservation Slip (ERS) : 4344500102 16340/Ncj CSMT Express Ac 3 Tier Sleeper (3A)
No ratings yet
Electronic Reservation Slip (ERS) : 4344500102 16340/Ncj CSMT Express Ac 3 Tier Sleeper (3A)
2 pages
Visual Foxpro Excel Question Paper
No ratings yet
Visual Foxpro Excel Question Paper
1 page
DSM Table of Specification 3Q
No ratings yet
DSM Table of Specification 3Q
14 pages
Speaking Rubric
No ratings yet
Speaking Rubric
2 pages
Science Assignment Year 10
No ratings yet
Science Assignment Year 10
5 pages
H 7 Gfre 4 W 3
No ratings yet
H 7 Gfre 4 W 3
27 pages
Typography For Lawyers - Review - Rappaport
No ratings yet
Typography For Lawyers - Review - Rappaport
2 pages
Lecture 6 Measurement of Angles and Direction
No ratings yet
Lecture 6 Measurement of Angles and Direction
72 pages
Holiday List 2025-Reg
No ratings yet
Holiday List 2025-Reg
8 pages
Classification of Production Systems
No ratings yet
Classification of Production Systems
4 pages
Curriculum Vitae: Prabhu Nath
No ratings yet
Curriculum Vitae: Prabhu Nath
4 pages
Interview Question For Oracle DBA
100% (1)
Interview Question For Oracle DBA
111 pages
Feminism Criticism
No ratings yet
Feminism Criticism
13 pages
Class 5 EVS-L-6
No ratings yet
Class 5 EVS-L-6
2 pages
Source:: Review: Lab Turns Trash Into Valuable Graphene in A Flash
No ratings yet
Source:: Review: Lab Turns Trash Into Valuable Graphene in A Flash
1 page
Chapter 4d) HAZOP PDF
No ratings yet
Chapter 4d) HAZOP PDF
61 pages
Aculon AL-B TDS
No ratings yet
Aculon AL-B TDS
2 pages
Smart Farming System Using IoT For Efficient Crop
No ratings yet
Smart Farming System Using IoT For Efficient Crop
4 pages
What Is Strategy and Why Is It Important?: Mcgraw-Hill/Irwin
No ratings yet
What Is Strategy and Why Is It Important?: Mcgraw-Hill/Irwin
35 pages
Electrokinetic Treatment of Desiccated Expansive Clay
No ratings yet
Electrokinetic Treatment of Desiccated Expansive Clay
34 pages
Example on seepage analysis by khola's metod#T4 (1)
100% (1)
Example on seepage analysis by khola's metod#T4 (1)
5 pages
Elite Private School: Parents' Orientation
No ratings yet
Elite Private School: Parents' Orientation
19 pages
Important Topics for Upsc 29
No ratings yet
Important Topics for Upsc 29
7 pages
Common Mistakes of Job Application
No ratings yet
Common Mistakes of Job Application
3 pages
Design Final Assessment
No ratings yet
Design Final Assessment
16 pages

1087

Uploaded by

1087

Uploaded by

Interspeech 2018

2-6 September 2018, Hyderabad

A deep learning approach to assessing non-native pronunciation of English

ALTA Institute / Engineering Department

Abstract Section 2 provides an overview of the approaches employed

You might also like