1087
1087
1626 10.21437/Interspeech.2018-1087
The full set of phone-pair distances describes the speaker’s
(f,i)T (b,i)T T
overall accent. These features should thus robustly represent hi = [hT −1 , h0 ] (6)
pronunciation in a way that is compact and independent of
speaker attributes. Kyriakopoulos et al. [13] proposed a model-
based phone distance feature system for pronunciation assess-
ment. Gaussian models are built for each phone and the K-L
divergences between them used to determine the speaker’s ac-
cent quality. This system is taken as the baseline in this paper.
Consider a speaker n whose utterances have been recog-
nised and time-aligned to a series of In phone instances
π0 , π1 , ..., πIn −1 , each corresponding to one of the 47 English
phones ω0 , ω1 , ..., ω46 (/aa/, /ax/, /ah/ etc. 1 ).
Each phone instance πi is itself a sequence of Ti frames
(i) (i) (i) (i)
x0 , x1 , ..., xTi −1 , where xt is the vector of perceptual lin- Figure 2: Standard, centre and attention-based sequence-to-
ear prediction (PLP) features extracted from frame t. Each vector bi-RNNs.
phone ωφ is represented by the parameters of a multivariate
(n) (n) This approach is problematic in this case, however, since
Gaussian model (µφ ,Σφ ), which predicts the PLP features
the first and last frames of each phone instance are actually the
produced each time the speaker utters an instance of:
least representative of its content (since they are on the align-
(i) (i) (n) (n) ment boundary). Two alternative methods are therefore consid-
p(xt |πi = ωφ ) = N (xt ; µφ , Σφ ) (1) ered:
Distances between models are represented using symmetric
1. Using the middle, rather than final, frame of each pass
KullbackLeibler (K-L) divergence:
(Figure 2, middle)
(f,i)T (b,i)T
(n) 1 (n) (n) (n) (n)
hi = [hT /2 , hT /2 ]T (7)
Dφ,ψ = hKL N (µφ , Σφ )||N (µψ , Σψ )
2 2. Using an attention mechanism to weight the importance
(n) (n)
+ KL N (µψ , Σψ )||N (µφ , Σφ ) i
(n) (n)
(2) of each frame (Figure 2, right)
Ti −1 Ti −1
Each speaker is thus represented by 1081 scalar phone-pair X (f,i) (f,i)
X (b,i) (b,i)
(n) (n) (n) hi = αt ht + αt ht (8)
distances D0,1 , D0,2 , ..., D46,47 , together forming the vector t=0 t=0
Dn . These features are then used to train a deep neural network
(DNN) to predict human-assigned proficiency scores: where
(f,i)
sn = f (Dn ) (3) (f,i) exp ct
αt = PT −1 (f,i) (b,i)
(9)
i
t=0 exp ct + exp ct
3. Siamese phone distance features (f,i) (f,i)
ct = f (ht , λα ) (10)
This paper proposes an alternative phone distance feature ap- (b,i)
proach to the model-based method described in the previous and similarly for αt (with same λα ).
section. The generative Gaussian model of each phone is re- Whichever of the three methods is used, the resultant network
placed with a feature vector projected up from the frame se- must be trained across all instances across all speakers to map
quences. This feature extractor is tunable and can be fine tuned frame sequences to a space in which distances between phones
to the task of automatic assessment, unlike the more general are most indicative of proficiency.
model-based approach. This also addresses a couple of issues Having defined a projection from the original audio frame
with the model-based phone distance feature approach above. sequences to a phone instance feature space, it is now necessary
By localising the phone representation to the level of the to use these projections to learn a phone distance metric. To
individual phone instance the amount of data needed to extract this end, Siamese networks are employed. A Siamese network
the features is reduced. In addition, the features should be more is composed of two copies of the same neural network, each fed
interpretable. The first step is to project the frame sequence with one of the elements of a pair of input samples. These iden-
(i) (i) (i)
x0 , x1 , ..., xTi −1 of each phone instance i to a fixed-length tical networks project the samples into an embedding space. A
vector representation hi by passing it through a bi-directional measure of distance is then computed between the two samples
Long Short Term Memory (LSTM) network: depending on their relation label (same or different). The er-
ror is propagated evenly in the two copies. The architecture is
(f,i) (i) (f,i)
ht = f (xt , ht−1 , λ(f,i) ) (4) based on the LSTM Siamese architecture for learning difference
metrics between pairs of variable length sequences presented by
(b,i) (i) (b,i) Mueller and Thyagarajan[16].
ht = f (xt , ht+1 , λ(b,i) ) (5)
Consider a pair of phone instances πi and πj , taken from
The standard mechanism of obtaining a fixed-length vector the same speaker n, of phones ωφ and ωψ respectively. Using
hi from the forward and backward sequences of a Recurrent the method from the previous section, they are projected to vec-
Neural Network (RNN) is to concatenate the two sequences’ tors hi and hj . The distance between the two instances can
final time steps (Figure 2, left): therefore be represented as:
1 based on ARPABET phone set [15] di,j = ||hi − hj ||2 (11)
1627
Two approaches are considered for training this metric to
(n)
approximate the distance between the two instances. The first (n) exp ci
αi = P (18)
is simply to train di,j to approximate the model-based K-L di- exp ci
(n)
(n) i:πi =ωφ
vergence distance Dφ,ψ for phones φ and ψ in speaker n. The
second is to use di,j to predict whether the two instances are (n)
ci
(n)
= g(hi ) (19)
instances of the same phone:
where g is a word-level LSTM, since each instance’s sur-
rounding context is expected to affect how important it
1, ωφ = ωψ
cij = (12) is to score.
0 ωφ 6= ωψ
The distance feature between each pair of phones ωφ , ωψ
This is done by passing di,j through a sigmoid to derive the
can now be calculated as:
probability pij that cij = 1
(n)
2 dφ,ψ = ||hφ − hψ ||2 (20)
pij = −1 (13)
1 + exp(−dij ) resulting in 1081 scalar distances
(n) (n) (n)
d0,1 , d0,2 , ..., d46,47 ,
to-
Given the large number of possible instance pairs for each gether forming vector dn which is passed through a feed-
(n) (n) (n) (n) (n) (n)
speaker n, M pairs (l0 , r0 ), (l1 , r1 ), ..., (lM −1 , rM −1 ) forward layer to predict the score:
are sampled for use in training (in the experiments in this paper
M = 100). In the K-L training case sampling is completely sn = f (dn ) (21)
random, while in the binary case M 2
pairs of instances of the End-to-end training of the whole system (after initialisation
same phone and M 2
pairs of instances of different phones are using the Siamese network training) with the MSE criterion can
sampled. now be performed using all N speakers in the training set:
If the M pairs have corresponding phone labels ( )
N −1
(n) (n) (n) (n) (n) (n)
(φ0 , ψ0 ), (φ1 , ψ1 ), ..., (φM −1 , ψM −1 ), the objec- 1 X 2
min (sn − f (dn )) (22)
tive functions over all N speakers in the training becomes, for N n=0
K-L training:
5. Experimental Setup
( N −1 M −1
)
X X (n) The preceding sections of this paper have described a system for
min log|dl(n) ,r(n) − D (n) (n) (14) predicting pronunciation proficiency score based on the audio
m m φm ,ψm
n=0 m=0
frames and aligned phone sequence of the utterances produced
and for binary training: by a candidate. The data for initialising, training and testing
this system are obtained from candidate responses to the spo-
ken component of the Business Language Testing Service (BU-
(
N
X −1 M
X −1
log(pl(n) ,r(n) ),
(n)
φm = ψm
(n) LATS) for foreign learners of English, provided by Cambridge
min m m
(n) (n) (15) English Language Assessment. The BULATS speaking test has
n=0 m=0
log(1 − pl(n) ,r(n) ) φm 6= ψm five sections, all related to business scenarios [17]. Section A
m m
consists of short responses to prompted questions. Candidates
The vector hi for each instance i can now be said to repre- read 8 sentences aloud in Section B. Sections C-E consist of
sent a space in which Euclidean distance is representative of the spontaneous responses of several sentences in length to a series
conceptual distance between phones. This representation must of spoken and visual prompts. Candidates are scored on a scale
now be combined from the phone instance to the overall phone from 0 to 30, based on their overall proficiency, and it is this
level for each speaker and finally projected to predict speaker score that the system is predicting.
score. The systems are trained on a gender and proficiency level
balanced mixed L1 dataset (TRN) consisting of 994 speakers
4. Predicting grade (first languages Polish, Vietnamese, Arabic, Dutch, French and
Thai), scored on their overall proficiency (not just pronuncia-
Once trained, the systems described in the previous section can
tion) by human graders and evaluated on a held out evaluation
extract distance features at the level of phone instance pairs.
set (EVL), consisting of 226 speakers of a similar mix of L1s,
The next step is to integrate these systems into an end-to-end
gender and proficiency, with scores provided by expert human
system for predicting a speaker’s proficiency score.
(n) graders.
The projected phone instance features hi for all instances As discussed in Section 2, the first step before passing the
(n) (n)
πi of phone ωφ in speaker n, can be combined to derive an date through the system is recognising the text being spoken and
(n)
overall phone vector hφ in one of two ways: aligning the audio to a sequence of phones. Both these tasks are
performed using an automatic speech recogniser (ASR). Due to
1. By simple averaging: the incorrect pronunciations, grammar and rhythm, related to
the speaker’s proficiency level and first language (L1), the ac-
1 X
hφ
(n)
= P
(n)
hi (16) curacy of standard commercial ”off-the-shelf” ASR systems is
i:πi =ωφ 1 i:πi =ωφ too low for non-native learner English. Instead, the ASR system
from Kyriakopoulos et al. [13] (also described in Van Dalen et
2. Via another attention mechanism: al.[18]), which is trained on non-native learners of English, is
X used. This ASR has an overall word error rate (WER) of 47.5%
hφ
(n)
=
(n)
αi hi
(n)
(17) and a phone error rate (PER) of 33.9%, evaluated against crowd
i:πi =ωφ sourced transcriptions of EVL.
1628
6. Experimental Results now employed to predict proficiency scores. The system is con-
nected end-to-end using the averaging and attention methods
Sections 3 and 4 describe a pronunciation assessment system
and further trained to predict grade. The mean squared error
in two stages. First, an LSTM of either the standard, centre or
(MSE) results of these experiments are presented in Table 4.
attention variety is trained, in a Siamese architecture, to extract
phone instance features to predict, for a given phone instance MSE
pair, either a binary or a K-L divergence distance metric. Next, Initialisation
Average Attention
the trained LSTM is integrated into an end-to-end score predic-
Binary 19.7 17.6
tion system, using either an averaging or attention mechanism
K-L 16.4 14.2
to move from the phone instance level to the phone level.
Given the above, it is necessary to evaluate, using the setup Table 4: Performance (mean squared error of predicted to
described in Section 5, first, how well the system performs at human-assigned scores) of baseline and Siamese systems,
the initialisation tasks, second, which of the proposed architec- trained using binary and K-L divergence criteria, with or with-
tures (standard vs. centre vs. attention LSTM, binary vs. K-L out the extra attention layer and fine tuning, each trained on
training, averaging vs. attention combination) is best perform- TRN and evaluated on EVL
ing and, third, how the best architecture performs relative to the All systems are able to predict score with a reasonable
baseline. amount of accuracy and, as expected, adding the attention
First, the Siamese networks presented in Section 3 are mechanism improves performance. The systems which were
trained on randomly selected pairs from all the speakers in the initialised using the K-L divergences from the baseline method
TRN data set. They are evaluated on similar pairs from the EVL outperform those initialised using the binary classifier, which is
data set for the two tasks of binary (same vs. different) classi- to be expected given the superior granularity of K-L divergences
fication and predicting the K-L divergences from the baseline compared to the binary variant.
model. Table 1 shows the results for the standard bi-LSTM con- Finally, the performance of the best architecture (attention
figuration where the final time steps are concatenated to form LSTM, K-L criterion, attention combination), is compared to
the fixed length phone instance representation hi . that of the baseline (Table 5). The system outperforms the base-
line in terms of MSE but is comparable for PCC. This can be
Criterion Binary Accuracy K-L Performance explained given that the new system is optimised end-to-end for
Binary 75.0% 0.599 minimum MSE, whereas for the baseline only the grader is op-
K-L 68.0% 0.789 timised for minimum MSE.
Table 1: Performance of standard bi-LSTM configuration Initialisation MSE PCC
Siamese networks
Baseline 14.8 0.785
As expected, the networks perform better on the task they End-to-end system 14.2 0.780
are trained for than on the other task. Both systems perform Table 5: Performance (mean squared error and Pearson cor-
well, suggesting that the Siamese networks are capable of ex- relation coefficient of predicted to human-assigned scores) of
tracting interpretable distance metrics indicative of both the baseline and Siamese systems, trained using binary and K-L di-
clustering together of instances of the same phone as well as vergence criteria, each trained on TRN and evaluated on EVL
the distances between distributions of different phones. Fur-
ther, the fact that the system trained for each task also performs 7. Conclusions
reasonably on the other task, suggests that these two concepts
of distance are closely related, as was to be expected. Phone distance features have previously been shown to be a
The experiments are now repeated all three bi-LSTM vari- good indicator of accent pronunciation quality allowing use
ants (standard, centre and attention) of the phone instance rep- in assessing the proficiency of a non-native learner’s speech,
resentation to determine which is best. As expected, and can in particular, for assessment of spontaneous spoken responses.
be seen in Tables 2 and 3, the attention LSTM performs the This paper has proposed an alternative to the model-based ap-
best overall, with the centre time steps configuration better for proaches to phone distances based on Siamese networks.
matched criteria than the standard configuration. The attention It was first shown how Siamese networks can be used to ex-
bi-LSTM is slower to train but is used going forward due to this tract distance metrics between pairs of phone instances. These
better representation. can be used to predict whether the two are instances of the same
or different phones, as well as the relative entropies between the
Criterion Std Cen Att distributions of their two phones if they are instances of differ-
Binary 75.0% 77.5% 77.3% ent phones. The distance measure is tunable, performing dif-
K-L 68.0% 67.4% 69.0% ferently on different tasks depending on how it is trained. The
Table 2: Binary accuracy of standard (Std), centre (Cen) and latter paradigm, which involves calculating model-based rela-
attention (Att) bi-LSTM Siamese network configurations tive entropies before training, was shown to produce a superior
distance measure. Of the three different architectures consid-
ered, that involving an attention mechanism outperformed the
Criterion Std Cen Att standard and centre-based sequence-to-vector architectures.
Binary 0.599 0.587 0.602 Finally, the networks were also used to develop a profi-
K-L 0.789 0.792 0.788 ciency grader, alternatively using an averaging and attention
Table 3: K-L performance of standard (Std), centre (Cen) and mechanism to move from the phone instance level to the phone
attention (Att) bi-LSTM Siamese network configurations level, with the latter proving superior. Trained in an end-to-end
fashion this grader was able to predict human-assigned profi-
Having established that the Siamese networks seem to in- ciency scores with performance surpassing model-based phone
deed be extracting valid distance features, these features are distance features.
1629
8. References
[1] D. Graddol, English Next. British Council, 2006.
[2] S. M. Witt, “Automatic error detection in pronunciation training:
Where we are and where we need to go,” in Proc. IS ADEPT,
vol. 6, 2012.
[3] S. Detey, L. Fontan, and T. Pellegrini, “Traitement de la pronon-
ciation en langue étrangère: approches didactiques, méthodes au-
tomatiques et enjeux pour lapprentissage,” Revue Traitement Au-
tomatique des Langues, 2016.
[4] R. van Dalen, K. Knill, and M. Gales, “Automatically Grading
Learners’ English Using a Gaussian Process,” in SLaTE, 2015.
[5] Council of Europe, Common European Framework of Reference
for Languages: Learning, Teaching, Assessment. Cambridge
University Press, 2001.
[6] N. Moustroufas and V. Digalakis, “Automatic pronunciation eval-
uation of foreign speakers using unknown text,” CSL, vol. 21, pp.
219–230, 2007.
[7] A. Metallinou and J. Cheng, “Using Deep Neural Networks to
Improve Proficiency Assessment for Children English Language
Learners,” in INTERSPEECH, 2014, pp. 1468–1472.
[8] M. Nicolao, A. V. Beeston, and T. Hain, “Automatic assessment
of english learner pronunciation using discriminative classifiers,”
in ICASSP. IEEE, 2015, pp. 5351–5355.
[9] S. Witt and S. Young, “Phone-level pronunciation scoring and
assessment for interactive language learning,” SPEECHCOM,
vol. 30, pp. 95–108, 2000.
[10] H. Franco, L. Neumeyer, V. Digalakis, and O. Ronen, “Combi-
nation of machine scores for automatic grading of pronunciation
quality,” SPEECHCOM, 2000.
[11] N. Minematsu, S. Asakawa, and K. Hirose, “Structural represen-
tation of the pronunciation and its use for CALL,” in SLT, 2006,
pp. 126–129.
[12] S. Kasahara, N. Minematsu, H. Shen, D. Saito, and K. Hi-
rose, “Structure-based prediction of english pronunciation dis-
tances and its analytical investigation,” in Information Science and
Technology (ICIST), 2014 4th IEEE International Conference on.
IEEE, 2014, pp. 332–336.
[13] K. Kyriakopoulos, M. Gales, and K. Knill, “Automatic Character-
isation of the Pronunciation of Non-native English Speakers using
Phone Distance Features,” in SLATE, 2017.
[14] K. Knill, M. Gales, K. Kyriakopoulos, A. Ragni, and Y. Wang,
“Use of Graphemic Lexicons for Spoken Language Assessment,”
in INTERSPEECH, 2017, pp. 2774–2778.
[15] D. Jurafsky and J. H. Martin, Speech and Language Processing.
New Jersey: Prentice-Hall, 2000.
[16] J. Mueller and A. Thyagarajan, “Siamese Recurrent Architectures
for Learning Sentence Similarity,” in AAAI, 2016, pp. 2786–2792.
[17] L. Chambers and K. Ingham, “The BULATS online speak-
ing test,” Research Notes, vol. 43, pp. 21–25, 2011.
[Online]. Available: http://www.cambridgeenglish.org/images/
23161-research-notes-43.pdf
[18] R. C. van Dalen, K. M. Knill, P. Tsiakoulis, and M. J. F. Gales,
“Improving multiple-crowd-sourced transcriptions using a speech
recogniser,” in ICASSP, Apr 2015.
1630