Attention-Based Models For Text-Dependent Speaker Verification Moreno
Attention-Based Models For Text-Dependent Speaker Verification Moreno
1. INTRODUCTION
2. BASELINE ARCHITECTURE
Speaker verification (SV) is the process of verifying, based on a set
Our end-to-end training architecture [7] is described in Fig. 1. For
of reference enrollment utterances, whether an verification utterance
each training step, a tuple of one evaluation utterance xj∼ and N en-
belongs to a known speaker. One subtask of SV is global password
rollment utterances xkn (for n = 1, · · · , N ) is fed into our LSTM
text-dependent speaker verification (TD-SV), which refers to the set
network: {xj∼ , (xk1 , · · · , xkN )}, where x represents the features
of problems for which the transcripts of reference enrollment and
(log-mel-filterbank energies) from a fixed-length segment, j and k
verification utterances are constrained to a specific phrase. In this
represent the speakers of the utterances, and j may or may not equal
study, we focus on “OK Google” and “Hey Google” global pass-
k. The tuple includes a single utterance from speaker j, and N dif-
words, as they relate to the Voice Match feature of Google Home
ferent utterance from speaker k. We call a tuple positive if xj∼ and
[1, 2].
the N enrollment utterances are from the same speaker, i.e., j = k,
I-vector [3] based systems in combination with verification and negative otherwise. We generate positive and negative tuples
back-ends such as Probabilistic Linear Discriminant Analysis alternatively.
(PLDA) [4] have been the dominating paradigm of SV in previ-
For each utterance, let the output of the LSTM’s last layer at
ous years. More recently, with the rising of deep learning [5] in
frame t be a fixed dimensional vector ht , where 1 ≤ t ≤ T . We
various machine learning applications, more efforts have been fo-
take the last frame output as the d-vector ω = hT (Fig. 2a), and
cusing on using neural networks for speaker verification. Currently,
build a new tuple: {ω j∼ , (ω k1 , · · · , ω kN )}. The centroid of tuple
the most promising approaches are end-to-end integrated architec-
(ω k1 , · · · , ω kN ) represents the voiceprint built from N utterances,
tures that simulate the enrollment-verification two-stage process
and is defined as follows:
during training.
1 ω kn
For example, in [6] the authors propose architectures that re- N
semble the components of an i-vector + PLDA system. Such archi- ck = En [ω kn ] = . (1)
tecture allowed to bootstrap the network parameters from pretrained N n=1 ω kn 2
i-vector and PLDA models for a better performance. However, such
initialization stage also constrained the type of network architectures The similarity is defined using the cosine similarity function:
that could be used — only Deep Neural Networks (DNN) can be
initialized from classical i-vector and PLDA models. In [7], we s = w · cos(ω j∼ , ck ) + b, (2)
have shown that Long Short-Term Memory (LSTM) networks [8]
can achieve better performance than DNNs for integrated end-to-end with learnable w and b. The tuple-based end-to-end loss is finally
architectures in TD-SV scenarios. defined as:
∗ The author did this work during his intern at Google. LT (ω j∼ , ck ) = δ(j, k)σ(s) + 1 − δ(j, k) 1 − σ(s) . (3)
Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
DFFHSWUHMHFW GYHFWRU
ORJLVWLFUHJUHVVLRQ
/670
6FRUH)XQFWLRQ RXWSXWV
FRVLQHVLPLODULW\
/670
6SHDNHU0RGHO
/670 (a)
GYHFWRU
HYDOXDWLRQ HQUROOPHQWW HQUROOPHQW
XWWHUDQFH
XWWHUDQFH X
XWWHUDQFH1
1RUPDOL]HG
Fig. 1: Our baseline end-to-end training architecture as introduced ZHLJKWV
in [7].
/670
Here σ(x) = 1/(1 + e−x ) is the standard sigmoid function and RXWSXWV
δ(j, k) equals 1 if j = k, otherwise equals to 0. The end-to-end loss
function encourages a larger value of s when k = j, and a smaller /670
value of s when k = j. Consider the update for both positive and
negative tuples — this loss function is very similar to the triplet loss
in FaceNet [14]. ,QSXW
IHDWXUHV
3.1. Basic attention layer Fig. 2: (a) LSTM-based d-vector baseline [7]. (b) Basic attention
In our baseline end-to-end training, we directly take the last frame layer.
output as d-vector ω = hT . Alternatively, we could learn a scalar
score et ∈ R for the LSTM output ht at each frame t:
• Shared-parameter linear attention, where the m-dimensional
et = f (ht ), t = 1, · · · , T. (4) vector w and scalar b are the same for all frames.
Then we can compute the normalized weights αt ∈ [0, 1] using these et = fSL (ht ) = wT ht + b. (9)
scores:
exp(et )
αt = T , (5) • Non-linear attention, where Wt is an m × m matrix, bt and
i=1 exp(ei ) vt are m -dimensional vectors. The dimension m can be
such that Tt=1 αt = 1. And finally, as shown in Fig. 2b, we form tuned on a development dataset.
the d-vector ω as the weighted average of the LSTM outputs at all
frames: et = fNL (ht ) = vtT tanh(Wt ht + bt ). (10)
T
ω= α t ht . (6) • Shared-parameter non-linear attention, where the same W, b
t=1 and v are used for all frames.
5360
Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
GYHFWRU
1RSRROLQJ
WLPH
6OLGLQJZLQGRZPD[SRROLQJ
/DVWOD\HU
RXWSXWV
ZLQGRZ
QGWRODVW ZLQGRZ
OD\HURXWSXWV
*OREDOWRS.PD[SRROLQJ.
(a)
GYHFWRU
Fig. 4: Different pooling methods on attention weights. The tth pixel
corresponds to the weight αt , and a brighter intensity means a larger
value of the weight.
3DUWD 3DUWE • Global top-K maxpooling: Only keep the largest K values in
the weights, and set all other values to 0.
4. EXPERIMENTS
/DVWOD\HU
RXWSXWV 4.1. Datasets and basic setup
5361
Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
Table 1: Evaluation EER(%): Non-attention baseline model vs. basic attention layer using different scoring functions.
Table 2: Evaluation EER(%): Basic attention layer vs. variants — 2 ND\ *RR
R JOH
all using fSNL as scoring function.
XWWHUDQFH
Test data Basic fSNL Cross-layer Divided-layer
OK → OK 0.78 0.81 0.75
OK → Hey 2.66 2.61 2.44
Hey → OK 2.08 2.03 2.07
Hey → Hey 1.01 0.97 0.99
Average 1.63 1.61 1.56
5362
Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [15] Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long
short-term memory recurrent neural network architectures for
[1] Yury Pinsky, “Tomato, tomahto. google home now sup- large scale acoustic modeling,” in Fifteenth Annual Conference
ports multiple users,” https://www.blog.google/ of the International Speech Communication Association, 2014.
products/assistant/tomato-tomahto-google-
[16] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,
home-now-supports-multiple-users, 2017.
“Generalized end-to-end loss for speaker verification,” arXiv
[2] Mihai Matei, “Voice match will allow google home to recog- preprint arXiv:1710.10467, 2017.
nize your voice,” https://www.androidheadlines.
com/2017/10/voice-match-will-allow- [17] Quan Wang, Carlton Downey, Li Wan, Philip Mansfield, and
google-home-to-recognize-your-voice.html, Ignacio Lopez Moreno, “Speaker diarization with lstm,” arXiv
2017. preprint arXiv:1710.10468, 2017.
[3] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Du-
mouchel, and Pierre Ouellet, “Front-end factor analysis for
speaker verification,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[4] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of
i-vector length normalization in speaker recognition systems.,”
in Interspeech, 2011, pp. 249–252.
[5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep
learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[6] Johan Rohdin, Anna Silnova, Mireia Diez, Oldrich Plchot,
Pavel Matejka, and Lukas Burget, “End-to-end dnn based
speaker recognition inspired by i-vector and plda,” arXiv
preprint arXiv:1710.02369, 2017.
[7] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam
Shazeer, “End-to-end text-dependent speaker verification,”
in Acoustics, Speech and Signal Processing (ICASSP), 2016
IEEE International Conference on. IEEE, 2016, pp. 5115–
5119.
[8] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term
memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,
1997.
[9] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-
footprint keyword spotting using deep neural networks,” in
Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on. IEEE, 2014, pp. 4087–
4091.
[10] Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum
Nakkiran, and Tara N Sainath, “Automatic gain control and
multi-style training for robust small-footprint keyword spotting
with deep neural networks,” in Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Conference on.
IEEE, 2015, pp. 4704–4708.
[11] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-
els for speech recognition,” in Advances in Neural Information
Processing Systems, 2015, pp. 577–585.
[12] Minh-Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attention-based neural machine trans-
lation,” arXiv preprint arXiv:1508.04025, 2015.
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Ben-
gio, “Show, attend and tell: Neural image caption generation
with visual attention,” in International Conference on Machine
Learning, 2015, pp. 2048–2057.
[14] Florian Schroff, Dmitry Kalenichenko, and James Philbin,
“Facenet: A unified embedding for face recognition and clus-
tering,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 815–823.
5363
Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.