0% found this document useful (0 votes)
46 views5 pages

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

Bidhan Barai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views5 pages

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

Bidhan Barai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ATTENTION-BASED MODELS FOR TEXT-DEPENDENT SPEAKER VERIFICATION

F A Rezaur Rahman Chowdhury∗ Quan Wang, Ignacio Lopez Moreno, Li Wan

Washington State University Google Inc., USA


[email protected] {quanw, elnota, liwan} @google.com

ABSTRACT However, one challenge in our architecture introduced in [7]


is that, silence and background noise are not being well captured.
Attention-based models have recently shown great performance on Though our speaker verification runs on a short 800ms window that
a range of tasks, such as speech recognition, machine translation, is segmented by the keyword detector [9, 10], the phonemes are usu-
and image captioning due to their ability to summarize relevant in- ally surrounded by frames of silence and background noise. Ideally,
formation that expands through the entire length of an input se- the speaker embedding should be built only using the frames corre-
quence. In this paper, we analyze the usage of attention mechanisms sponding to phonemes. Thus, we propose to use an attention layer
to the problem of sequence summarization in our end-to-end text- [11, 12, 13] as a soft mechanism to emphasize the most relevant el-
dependent speaker recognition system. We explore different topolo- ements of the input sequence.
gies and their variants of the attention layer, and compare different This paper is organized as follows. In Sec. 2, we first briefly
pooling methods on the attention weights. Ultimately, we show that review our LSTM-based d-vector baseline approach trained with the
attention-based models can improves the Equal Error Rate (EER) of end-to-end architecture [7]. In Sec. 3, we introduce how we add the
our speaker verification system by relatively 14% compared to our attention mechanism to our baseline architecture, covering differ-
non-attention LSTM baseline model. ent scoring functions, layer variants, and weights pooling methods.
Index Terms— Attention-based model, sequence summariza- In Sec. 4 we setup experiments to compare attention-based mod-
tion, speaker recognition, pooling, LSTM els against our baseline model, and present the EER results on our
testing set. Conclusions are made in Sec. 5.

1. INTRODUCTION
2. BASELINE ARCHITECTURE
Speaker verification (SV) is the process of verifying, based on a set
Our end-to-end training architecture [7] is described in Fig. 1. For
of reference enrollment utterances, whether an verification utterance
each training step, a tuple of one evaluation utterance xj∼ and N en-
belongs to a known speaker. One subtask of SV is global password
rollment utterances xkn (for n = 1, · · · , N ) is fed into our LSTM
text-dependent speaker verification (TD-SV), which refers to the set
network: {xj∼ , (xk1 , · · · , xkN )}, where x represents the features
of problems for which the transcripts of reference enrollment and
(log-mel-filterbank energies) from a fixed-length segment, j and k
verification utterances are constrained to a specific phrase. In this
represent the speakers of the utterances, and j may or may not equal
study, we focus on “OK Google” and “Hey Google” global pass-
k. The tuple includes a single utterance from speaker j, and N dif-
words, as they relate to the Voice Match feature of Google Home
ferent utterance from speaker k. We call a tuple positive if xj∼ and
[1, 2].
the N enrollment utterances are from the same speaker, i.e., j = k,
I-vector [3] based systems in combination with verification and negative otherwise. We generate positive and negative tuples
back-ends such as Probabilistic Linear Discriminant Analysis alternatively.
(PLDA) [4] have been the dominating paradigm of SV in previ-
For each utterance, let the output of the LSTM’s last layer at
ous years. More recently, with the rising of deep learning [5] in
frame t be a fixed dimensional vector ht , where 1 ≤ t ≤ T . We
various machine learning applications, more efforts have been fo-
take the last frame output as the d-vector ω = hT (Fig. 2a), and
cusing on using neural networks for speaker verification. Currently,
build a new tuple: {ω j∼ , (ω k1 , · · · , ω kN )}. The centroid of tuple
the most promising approaches are end-to-end integrated architec-
(ω k1 , · · · , ω kN ) represents the voiceprint built from N utterances,
tures that simulate the enrollment-verification two-stage process
and is defined as follows:
during training.
1  ω kn
For example, in [6] the authors propose architectures that re- N
semble the components of an i-vector + PLDA system. Such archi- ck = En [ω kn ] = . (1)
tecture allowed to bootstrap the network parameters from pretrained N n=1 ω kn 2
i-vector and PLDA models for a better performance. However, such
initialization stage also constrained the type of network architectures The similarity is defined using the cosine similarity function:
that could be used — only Deep Neural Networks (DNN) can be
initialized from classical i-vector and PLDA models. In [7], we s = w · cos(ω j∼ , ck ) + b, (2)
have shown that Long Short-Term Memory (LSTM) networks [8]
can achieve better performance than DNNs for integrated end-to-end with learnable w and b. The tuple-based end-to-end loss is finally
architectures in TD-SV scenarios. defined as:
  
∗ The author did this work during his intern at Google. LT (ω j∼ , ck ) = δ(j, k)σ(s) + 1 − δ(j, k) 1 − σ(s) . (3)

978-1-5386-4658-8/18/$31.00 ©2018 IEEE 5359 ICASSP 2018

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
DFFHSWUHMHFW GYHFWRU

ORJLVWLFUHJUHVVLRQ
/670 
6FRUH)XQFWLRQ RXWSXWV

FRVLQHVLPLODULW\
/670
6SHDNHU0RGHO

6SHDNHU DYHUDJH ,QSXW 


5HSUHVHQWDWLRQ  IHDWXUHV

/670 (a)

GYHFWRU
HYDOXDWLRQ HQUROOPHQWW  HQUROOPHQW
XWWHUDQFH 
XWWHUDQFH X
XWWHUDQFH1
1RUPDOL]HG
Fig. 1: Our baseline end-to-end training architecture as introduced ZHLJKWV 
in [7].

/670 
Here σ(x) = 1/(1 + e−x ) is the standard sigmoid function and RXWSXWV
δ(j, k) equals 1 if j = k, otherwise equals to 0. The end-to-end loss
function encourages a larger value of s when k = j, and a smaller /670
value of s when k = j. Consider the update for both positive and
negative tuples — this loss function is very similar to the triplet loss
in FaceNet [14]. ,QSXW 
IHDWXUHV

3. ATTENTION-BASED MODEL (b)

3.1. Basic attention layer Fig. 2: (a) LSTM-based d-vector baseline [7]. (b) Basic attention
In our baseline end-to-end training, we directly take the last frame layer.
output as d-vector ω = hT . Alternatively, we could learn a scalar
score et ∈ R for the LSTM output ht at each frame t:
• Shared-parameter linear attention, where the m-dimensional
et = f (ht ), t = 1, · · · , T. (4) vector w and scalar b are the same for all frames.
Then we can compute the normalized weights αt ∈ [0, 1] using these et = fSL (ht ) = wT ht + b. (9)
scores:
exp(et )
αt = T , (5) • Non-linear attention, where Wt is an m × m matrix, bt and
i=1 exp(ei ) vt are m -dimensional vectors. The dimension m can be

such that Tt=1 αt = 1. And finally, as shown in Fig. 2b, we form tuned on a development dataset.
the d-vector ω as the weighted average of the LSTM outputs at all
frames: et = fNL (ht ) = vtT tanh(Wt ht + bt ). (10)
 T
ω= α t ht . (6) • Shared-parameter non-linear attention, where the same W, b
t=1 and v are used for all frames.

3.2. Scoring functions et = fSNL (ht ) = vT tanh(Wht + b). (11)


By using different scoring functions f (·) in Eq. (4), we get different
attention layers: In all the above scoring functions, all the parameters are train-
able within the end-to-end architecture [7].
• Bias-only attention, where bt is a scalar. Note this attention
does not depend on the LSTM output ht .
3.3. Attention layer variants
et = fBO (ht ) = bt . (7)
Apart from the basic attention layer described in Sec. 3.1, here we
• Linear attention, where wt is an m-dimensional vector, and introduce two variants: cross-layer attention, and divided-layer at-
bt is a scalar. tention.
For cross-layer attention (Fig. 3a), the scores et and weights
et = fL (ht ) = wtT ht + bt . (8) αt are not computed using the outputs of the last LSTM layer

5360

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
GYHFWRU
1RSRROLQJ

WLPH
6OLGLQJZLQGRZPD[SRROLQJ

/DVWOD\HU 
RXWSXWV
ZLQGRZ
QGWRODVW  ZLQGRZ
OD\HURXWSXWV
*OREDOWRS.PD[SRROLQJ . 

(a)

GYHFWRU
Fig. 4: Different pooling methods on attention weights. The tth pixel
corresponds to the weight αt , and a brighter intensity means a larger
value of the weight.


3DUWD 3DUWE • Global top-K maxpooling: Only keep the largest K values in
  the weights, and set all other values to 0.

4. EXPERIMENTS
/DVWOD\HU 
RXWSXWV 4.1. Datasets and basic setup

To fairly compare different attention techniques, we use the same


(b) training and testing datasets for all our experiments.
Our training dataset is a collection of anonymized user voice
Fig. 3: Two variants of the attention layer: (a) cross-layer attention; queries, which is a mixture of “OK Google” and “Hey Google”. It
(b) divided-layer attention. has around 150M utterances from around 630K speakers. Our test-
ing dataset is a manual collection consisting of 665 speakers. It’s
divided into two enrollment sets and two verification sets for each of
{ht }1≤t≤T , but the outputs of an intermediate LSTM layer “OK Google” and “Hey Google”. Each enrollment and evaluation
{ht }1≤t≤T , e.g. the second-to-last layer: dataset contains respectively, an average of 4.5 and 10 evaluation
utterances per speaker.
et = f (ht ). (12)
We report the speaker verification Equal Error Rate (EER) on
However, the d-vector ω is still the weighted average of the last layer the four combinations of enrollment set and verification set.
output ht . Our baseline model is a 3-layer LSTM, where each layer has
For divided-layer attention (Fig. 3b), we double the dimension dimension 128, with a projection layer [15] of dimension 64. On
of the last layer LSTM output ht , and equally divide its dimension top of the LSTM is a linear layer of dimension 64. The acoustic
into two parts: part-a hat , and part-b hbt . We use part-a to build the parametrization consists of 40-dimensional log-mel-filterbank coef-
d-vector, while using part-b to learn the scores: ficients computed over a window of 25ms with 15ms of overlap. The
same acoustic features are used for both keyword detection [10] and
et = f (hbt ), (13) speaker verification.

T The keyword spotting system isolates segments of length T =
ω= αt hat . (14) 80 frames (800ms) that only contain the global password, and these
t=1 segments form the tuples mentioned above. The two keywords are
mixed together using the MultiReader technique introduced in [16].
3.4. Weights pooling
Another variation of the basic attention layer is that, instead of di- 4.2. Basic attention layer
rectly using the normalized weights αt to average LSTM outputs, First, we compare the baseline model with basic attention layer (Sec.
we can optionally perform maxpooling on the attention weights. 3.1) using different scoring function (Sec. 3.2). The results are
This additional pooling mechanism can potentially make our net- shown in Table 1. As we can see, while bias-only and linear attention
work more robust to temporal variations of the input signals. We bring little improvement to the EER, non-linear attention1 improves
have experimented with two maxpooling methods (Fig. 4): the performance significantly, especially with shared parameters.
• Sliding window maxpooling: We run a sliding window on the
weights, and for each window, only keep the largest value, 1 For the intermediate dimension of non-linear scoring functions, we use
and set other values to 0. m = 64, such that Wt and W are square matrices.

5361

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
Table 1: Evaluation EER(%): Non-attention baseline model vs. basic attention layer using different scoring functions.

Test data Non-attention Basic attention


Enroll → Verify baseline fBO fL fSL fNL fSNL
OK Google → OK Google 0.88 0.85 0.81 0.8 0.79 0.78
OK Google → Hey Google 2.77 2.97 2.74 2.75 2.69 2.66
Hey Google → OK Google 2.19 2.3 2.28 2.23 2.14 2.08
Hey Google → Hey Google 1.05 1.04 1.03 1.03 1.00 1.01
Average 1.72 1.79 1.72 1.70 1.66 1.63

Table 2: Evaluation EER(%): Basic attention layer vs. variants — 2 ND\ *RR
R JOH
all using fSNL as scoring function.

XWWHUDQFH
Test data Basic fSNL Cross-layer Divided-layer
OK → OK 0.78 0.81 0.75
OK → Hey 2.66 2.61 2.44
Hey → OK 2.08 2.03 2.07
Hey → Hey 1.01 0.97 0.99
Average 1.63 1.61 1.56

Table 3: Evaluation EER(%): Different pooling methods for atten- WLPH


tion weights — all using fSNL and divided-layer. D 1RSRROLQJ E 6OLGLQJZLQGRZ F *OREDOWRS.
PD[SRROLQJ PD[SRROLQJ

Test data No pooling Sliding window Top-K


OK → OK 0.75 0.72 0.72 Fig. 5: Visualized attention weights for different pooling methods.
OK → Hey 2.44 2.37 2.63 In each image, x-axis is time, and y-axis is for different utterances
Hey → OK 2.07 1.88 1.99 in a training batch. (a) No pooling; (b) Sliding window maxpooling,
Hey → Hey 0.99 0.95 0.94 where window size is 10, and step is 5; (c) Global top-K maxpool-
Average 1.56 1.48 1.57 ing, where K = 5.

4.3. Variants When we apply sliding window maxpooling or global top-K


maxpooling, the attention weights are much larger at the near-end of
To compare the basic attention layer with the two variants (Sec. 3.3), the utterance, which is easy to understand — the LSTM has accumu-
we use the same scoring function that performs the best in the pre- lated more information at the near-end than at the beginning, thus is
vious experiment: the shared-parameter non-linear scoring function more confident to produce the d-vector.
fSNL . From the results in Table 2, we can see that divided-layer at-
tention performs slightly better than basic attention and cross-layer
attention2 , at the cost that the dimension of last LSTM layer is dou- 5. CONCLUSIONS
bled.
In this paper, we experimented with different attention mechanisms
for our keyword-based text-dependent speaker verification system
4.4. Weights pooling [7]. From our experimental results, the best practice is to: (1) Use
a shared-parameter non-linear scoring function; (2) Use a divided-
To compare different pooling methods on the attention weights as in- layer attention connection to the last layer output of the LSTM; and
troduced in Sec. 3.4, we use the divided-layer attention with shared- (3) Apply a sliding window maxpooling on the attention weights.
parameter non-linear scoring function. For sliding window max- After combining all these best practices, we improved the EER of
pooling, we experimented with different window sizes and steps, our baseline LSTM model from 1.72% to 1.48%, which is a 14% rel-
and found that a window size of 10 frames and a step of 5 frames ative improvement. The same attention mechanisms, especially the
perform the best in our evaluations. Also, for global top-K max- ones using shared-parameter scoring functions, could potentially be
pooling, we found that the performance is the best when K = 5. used to improve text-independent speaker verification models [16]
The results are shown in Table 3. We can see that sliding window and speaker diarization systems [17].
maxpooling further improves the EER.
We also visualize the attention weights of a training batch for
different pooling methods in Fig. 5. An interesting observation is
that, when there’s no pooling, we can see a clear 4-strand or 3-strand
pattern in the batch. This pattern corresponds to the “O-kay-Goo-
gle” 4-phoneme or “Hey-Goo-gle” 3-phoneme structure of the key-
words.
2 In our experiments, for cross-layer attention, scores are learned from the
second-to-last layer.

5362

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [15] Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long
short-term memory recurrent neural network architectures for
[1] Yury Pinsky, “Tomato, tomahto. google home now sup- large scale acoustic modeling,” in Fifteenth Annual Conference
ports multiple users,” https://www.blog.google/ of the International Speech Communication Association, 2014.
products/assistant/tomato-tomahto-google-
[16] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,
home-now-supports-multiple-users, 2017.
“Generalized end-to-end loss for speaker verification,” arXiv
[2] Mihai Matei, “Voice match will allow google home to recog- preprint arXiv:1710.10467, 2017.
nize your voice,” https://www.androidheadlines.
com/2017/10/voice-match-will-allow- [17] Quan Wang, Carlton Downey, Li Wan, Philip Mansfield, and
google-home-to-recognize-your-voice.html, Ignacio Lopez Moreno, “Speaker diarization with lstm,” arXiv
2017. preprint arXiv:1710.10468, 2017.
[3] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Du-
mouchel, and Pierre Ouellet, “Front-end factor analysis for
speaker verification,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[4] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of
i-vector length normalization in speaker recognition systems.,”
in Interspeech, 2011, pp. 249–252.
[5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep
learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[6] Johan Rohdin, Anna Silnova, Mireia Diez, Oldrich Plchot,
Pavel Matejka, and Lukas Burget, “End-to-end dnn based
speaker recognition inspired by i-vector and plda,” arXiv
preprint arXiv:1710.02369, 2017.
[7] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam
Shazeer, “End-to-end text-dependent speaker verification,”
in Acoustics, Speech and Signal Processing (ICASSP), 2016
IEEE International Conference on. IEEE, 2016, pp. 5115–
5119.
[8] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term
memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,
1997.
[9] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-
footprint keyword spotting using deep neural networks,” in
Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on. IEEE, 2014, pp. 4087–
4091.
[10] Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum
Nakkiran, and Tara N Sainath, “Automatic gain control and
multi-style training for robust small-footprint keyword spotting
with deep neural networks,” in Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Conference on.
IEEE, 2015, pp. 4704–4708.
[11] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-
els for speech recognition,” in Advances in Neural Information
Processing Systems, 2015, pp. 577–585.
[12] Minh-Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attention-based neural machine trans-
lation,” arXiv preprint arXiv:1508.04025, 2015.
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Ben-
gio, “Show, attend and tell: Neural image caption generation
with visual attention,” in International Conference on Machine
Learning, 2015, pp. 2048–2057.
[14] Florian Schroff, Dmitry Kalenichenko, and James Philbin,
“Facenet: A unified embedding for face recognition and clus-
tering,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 815–823.

5363

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.

You might also like