0% found this document useful (0 votes)

46 views5 pages

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

Bidhan Barai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views5 pages

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

Bidhan Barai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ATTENTION-BASED MODELS FOR TEXT-DEPENDENT SPEAKER VERIFICATION

F A Rezaur Rahman Chowdhury∗ Quan Wang, Ignacio Lopez Moreno, Li Wan

Washington State University Google Inc., USA

[email protected] {quanw, elnota, liwan} @google.com

ABSTRACT However, one challenge in our architecture introduced in [7]

is that, silence and background noise are not being well captured.
Attention-based models have recently shown great performance on Though our speaker verification runs on a short 800ms window that
a range of tasks, such as speech recognition, machine translation, is segmented by the keyword detector [9, 10], the phonemes are usu-
and image captioning due to their ability to summarize relevant in- ally surrounded by frames of silence and background noise. Ideally,
formation that expands through the entire length of an input se- the speaker embedding should be built only using the frames corre-
quence. In this paper, we analyze the usage of attention mechanisms sponding to phonemes. Thus, we propose to use an attention layer
to the problem of sequence summarization in our end-to-end text- [11, 12, 13] as a soft mechanism to emphasize the most relevant el-
dependent speaker recognition system. We explore different topolo- ements of the input sequence.
gies and their variants of the attention layer, and compare different This paper is organized as follows. In Sec. 2, we first briefly
pooling methods on the attention weights. Ultimately, we show that review our LSTM-based d-vector baseline approach trained with the
attention-based models can improves the Equal Error Rate (EER) of end-to-end architecture [7]. In Sec. 3, we introduce how we add the
our speaker verification system by relatively 14% compared to our attention mechanism to our baseline architecture, covering differ-
non-attention LSTM baseline model. ent scoring functions, layer variants, and weights pooling methods.
Index Terms— Attention-based model, sequence summariza- In Sec. 4 we setup experiments to compare attention-based mod-
tion, speaker recognition, pooling, LSTM els against our baseline model, and present the EER results on our
testing set. Conclusions are made in Sec. 5.

1. INTRODUCTION
2. BASELINE ARCHITECTURE
Speaker verification (SV) is the process of verifying, based on a set
Our end-to-end training architecture [7] is described in Fig. 1. For
of reference enrollment utterances, whether an verification utterance
each training step, a tuple of one evaluation utterance xj∼ and N en-
belongs to a known speaker. One subtask of SV is global password
rollment utterances xkn (for n = 1, · · · , N ) is fed into our LSTM
text-dependent speaker verification (TD-SV), which refers to the set
network: {xj∼ , (xk1 , · · · , xkN )}, where x represents the features
of problems for which the transcripts of reference enrollment and
(log-mel-filterbank energies) from a fixed-length segment, j and k
verification utterances are constrained to a specific phrase. In this
represent the speakers of the utterances, and j may or may not equal
study, we focus on “OK Google” and “Hey Google” global pass-
k. The tuple includes a single utterance from speaker j, and N dif-
words, as they relate to the Voice Match feature of Google Home
ferent utterance from speaker k. We call a tuple positive if xj∼ and
[1, 2].
the N enrollment utterances are from the same speaker, i.e., j = k,
I-vector [3] based systems in combination with verification and negative otherwise. We generate positive and negative tuples
back-ends such as Probabilistic Linear Discriminant Analysis alternatively.
(PLDA) [4] have been the dominating paradigm of SV in previ-
For each utterance, let the output of the LSTM’s last layer at
ous years. More recently, with the rising of deep learning [5] in
frame t be a fixed dimensional vector ht , where 1 ≤ t ≤ T . We
various machine learning applications, more efforts have been fo-
take the last frame output as the d-vector ω = hT (Fig. 2a), and
cusing on using neural networks for speaker verification. Currently,
build a new tuple: {ω j∼ , (ω k1 , · · · , ω kN )}. The centroid of tuple
the most promising approaches are end-to-end integrated architec-
(ω k1 , · · · , ω kN ) represents the voiceprint built from N utterances,
tures that simulate the enrollment-verification two-stage process
and is defined as follows:
during training.
1 ω kn
For example, in [6] the authors propose architectures that re- N
semble the components of an i-vector + PLDA system. Such archi- ck = En [ω kn ] = . (1)
tecture allowed to bootstrap the network parameters from pretrained N n=1 ω kn 2
i-vector and PLDA models for a better performance. However, such
initialization stage also constrained the type of network architectures The similarity is defined using the cosine similarity function:
that could be used — only Deep Neural Networks (DNN) can be
initialized from classical i-vector and PLDA models. In [7], we s = w · cos(ω j∼ , ck ) + b, (2)
have shown that Long Short-Term Memory (LSTM) networks [8]
can achieve better performance than DNNs for integrated end-to-end with learnable w and b. The tuple-based end-to-end loss is finally
architectures in TD-SV scenarios. defined as:

∗ The author did this work during his intern at Google. LT (ω j∼ , ck ) = δ(j, k)σ(s) + 1 − δ(j, k) 1 − σ(s) . (3)

978-1-5386-4658-8/18/$31.00 ©2018 IEEE 5359 ICASSP 2018

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
DFFHSWUHMHFW GYHFWRU

ORJLVWLFUHJUHVVLRQ
/670
6FRUH)XQFWLRQ RXWSXWV

FRVLQHVLPLODULW\
/670
6SHDNHU0RGHO

6SHDNHU DYHUDJH ,QSXW

5HSUHVHQWDWLRQ IHDWXUHV

/670 (a)

GYHFWRU
HYDOXDWLRQ HQUROOPHQWW HQUROOPHQW
XWWHUDQFH
XWWHUDQFH X
XWWHUDQFH1
1RUPDOL]HG
Fig. 1: Our baseline end-to-end training architecture as introduced ZHLJKWV
in [7].

/670
Here σ(x) = 1/(1 + e−x ) is the standard sigmoid function and RXWSXWV
δ(j, k) equals 1 if j = k, otherwise equals to 0. The end-to-end loss
function encourages a larger value of s when k = j, and a smaller /670
value of s when k = j. Consider the update for both positive and
negative tuples — this loss function is very similar to the triplet loss
in FaceNet [14]. ,QSXW
IHDWXUHV

3. ATTENTION-BASED MODEL (b)

3.1. Basic attention layer Fig. 2: (a) LSTM-based d-vector baseline [7]. (b) Basic attention
In our baseline end-to-end training, we directly take the last frame layer.
output as d-vector ω = hT . Alternatively, we could learn a scalar
score et ∈ R for the LSTM output ht at each frame t:
• Shared-parameter linear attention, where the m-dimensional
et = f (ht ), t = 1, · · · , T. (4) vector w and scalar b are the same for all frames.
Then we can compute the normalized weights αt ∈ [0, 1] using these et = fSL (ht ) = wT ht + b. (9)
scores:
exp(et )
αt = T , (5) • Non-linear attention, where Wt is an m × m matrix, bt and
i=1 exp(ei ) vt are m -dimensional vectors. The dimension m can be

such that Tt=1 αt = 1. And ﬁnally, as shown in Fig. 2b, we form tuned on a development dataset.
the d-vector ω as the weighted average of the LSTM outputs at all
frames: et = fNL (ht ) = vtT tanh(Wt ht + bt ). (10)
T
ω= α t ht . (6) • Shared-parameter non-linear attention, where the same W, b
t=1 and v are used for all frames.

3.2. Scoring functions et = fSNL (ht ) = vT tanh(Wht + b). (11)

By using different scoring functions f (·) in Eq. (4), we get different
attention layers: In all the above scoring functions, all the parameters are train-
able within the end-to-end architecture [7].
• Bias-only attention, where bt is a scalar. Note this attention
does not depend on the LSTM output ht .
3.3. Attention layer variants
et = fBO (ht ) = bt . (7)
Apart from the basic attention layer described in Sec. 3.1, here we
• Linear attention, where wt is an m-dimensional vector, and introduce two variants: cross-layer attention, and divided-layer at-
bt is a scalar. tention.
For cross-layer attention (Fig. 3a), the scores et and weights
et = fL (ht ) = wtT ht + bt . (8) αt are not computed using the outputs of the last LSTM layer

5360

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
GYHFWRU
1RSRROLQJ

WLPH
6OLGLQJZLQGRZPD[SRROLQJ

/DVWOD\HU
RXWSXWV
ZLQGRZ
QGWRODVW ZLQGRZ
OD\HURXWSXWV
*OREDOWRS.PD[SRROLQJ.

(a)

GYHFWRU
Fig. 4: Different pooling methods on attention weights. The tth pixel
corresponds to the weight αt , and a brighter intensity means a larger
value of the weight.

3DUWD 3DUWE • Global top-K maxpooling: Only keep the largest K values in
the weights, and set all other values to 0.

4. EXPERIMENTS
/DVWOD\HU
RXWSXWV 4.1. Datasets and basic setup

To fairly compare different attention techniques, we use the same

(b) training and testing datasets for all our experiments.
Our training dataset is a collection of anonymized user voice
Fig. 3: Two variants of the attention layer: (a) cross-layer attention; queries, which is a mixture of “OK Google” and “Hey Google”. It
(b) divided-layer attention. has around 150M utterances from around 630K speakers. Our test-
ing dataset is a manual collection consisting of 665 speakers. It’s
divided into two enrollment sets and two verification sets for each of
{ht }1≤t≤T , but the outputs of an intermediate LSTM layer “OK Google” and “Hey Google”. Each enrollment and evaluation
{ht }1≤t≤T , e.g. the second-to-last layer: dataset contains respectively, an average of 4.5 and 10 evaluation
utterances per speaker.
et = f (ht ). (12)
We report the speaker verification Equal Error Rate (EER) on
However, the d-vector ω is still the weighted average of the last layer the four combinations of enrollment set and verification set.
output ht . Our baseline model is a 3-layer LSTM, where each layer has
For divided-layer attention (Fig. 3b), we double the dimension dimension 128, with a projection layer [15] of dimension 64. On
of the last layer LSTM output ht , and equally divide its dimension top of the LSTM is a linear layer of dimension 64. The acoustic
into two parts: part-a hat , and part-b hbt . We use part-a to build the parametrization consists of 40-dimensional log-mel-filterbank coef-
d-vector, while using part-b to learn the scores: ficients computed over a window of 25ms with 15ms of overlap. The
same acoustic features are used for both keyword detection [10] and
et = f (hbt ), (13) speaker verification.

T The keyword spotting system isolates segments of length T =
ω= αt hat . (14) 80 frames (800ms) that only contain the global password, and these
t=1 segments form the tuples mentioned above. The two keywords are
mixed together using the MultiReader technique introduced in [16].
3.4. Weights pooling
Another variation of the basic attention layer is that, instead of di- 4.2. Basic attention layer
rectly using the normalized weights αt to average LSTM outputs, First, we compare the baseline model with basic attention layer (Sec.
we can optionally perform maxpooling on the attention weights. 3.1) using different scoring function (Sec. 3.2). The results are
This additional pooling mechanism can potentially make our net- shown in Table 1. As we can see, while bias-only and linear attention
work more robust to temporal variations of the input signals. We bring little improvement to the EER, non-linear attention1 improves
have experimented with two maxpooling methods (Fig. 4): the performance significantly, especially with shared parameters.
• Sliding window maxpooling: We run a sliding window on the
weights, and for each window, only keep the largest value, 1 For the intermediate dimension of non-linear scoring functions, we use
and set other values to 0. m = 64, such that Wt and W are square matrices.

5361

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
Table 1: Evaluation EER(%): Non-attention baseline model vs. basic attention layer using different scoring functions.

Test data Non-attention Basic attention

Enroll → Verify baseline fBO fL fSL fNL fSNL
OK Google → OK Google 0.88 0.85 0.81 0.8 0.79 0.78
OK Google → Hey Google 2.77 2.97 2.74 2.75 2.69 2.66
Hey Google → OK Google 2.19 2.3 2.28 2.23 2.14 2.08
Hey Google → Hey Google 1.05 1.04 1.03 1.03 1.00 1.01
Average 1.72 1.79 1.72 1.70 1.66 1.63

Table 2: Evaluation EER(%): Basic attention layer vs. variants — 2 ND\ *RR
R JOH
all using fSNL as scoring function.

XWWHUDQFH
Test data Basic fSNL Cross-layer Divided-layer
OK → OK 0.78 0.81 0.75
OK → Hey 2.66 2.61 2.44
Hey → OK 2.08 2.03 2.07
Hey → Hey 1.01 0.97 0.99
Average 1.63 1.61 1.56

Table 3: Evaluation EER(%): Different pooling methods for atten- WLPH

tion weights — all using fSNL and divided-layer. D1RSRROLQJ E6OLGLQJZLQGRZ F*OREDOWRS.
PD[SRROLQJ PD[SRROLQJ

Test data No pooling Sliding window Top-K

OK → OK 0.75 0.72 0.72 Fig. 5: Visualized attention weights for different pooling methods.
OK → Hey 2.44 2.37 2.63 In each image, x-axis is time, and y-axis is for different utterances
Hey → OK 2.07 1.88 1.99 in a training batch. (a) No pooling; (b) Sliding window maxpooling,
Hey → Hey 0.99 0.95 0.94 where window size is 10, and step is 5; (c) Global top-K maxpool-
Average 1.56 1.48 1.57 ing, where K = 5.

4.3. Variants When we apply sliding window maxpooling or global top-K

maxpooling, the attention weights are much larger at the near-end of
To compare the basic attention layer with the two variants (Sec. 3.3), the utterance, which is easy to understand — the LSTM has accumu-
we use the same scoring function that performs the best in the pre- lated more information at the near-end than at the beginning, thus is
vious experiment: the shared-parameter non-linear scoring function more confident to produce the d-vector.
fSNL . From the results in Table 2, we can see that divided-layer at-
tention performs slightly better than basic attention and cross-layer
attention2 , at the cost that the dimension of last LSTM layer is dou- 5. CONCLUSIONS
bled.
In this paper, we experimented with different attention mechanisms
for our keyword-based text-dependent speaker verification system
4.4. Weights pooling [7]. From our experimental results, the best practice is to: (1) Use
a shared-parameter non-linear scoring function; (2) Use a divided-
To compare different pooling methods on the attention weights as in- layer attention connection to the last layer output of the LSTM; and
troduced in Sec. 3.4, we use the divided-layer attention with shared- (3) Apply a sliding window maxpooling on the attention weights.
parameter non-linear scoring function. For sliding window max- After combining all these best practices, we improved the EER of
pooling, we experimented with different window sizes and steps, our baseline LSTM model from 1.72% to 1.48%, which is a 14% rel-
and found that a window size of 10 frames and a step of 5 frames ative improvement. The same attention mechanisms, especially the
perform the best in our evaluations. Also, for global top-K max- ones using shared-parameter scoring functions, could potentially be
pooling, we found that the performance is the best when K = 5. used to improve text-independent speaker verification models [16]
The results are shown in Table 3. We can see that sliding window and speaker diarization systems [17].
maxpooling further improves the EER.
We also visualize the attention weights of a training batch for
different pooling methods in Fig. 5. An interesting observation is
that, when there’s no pooling, we can see a clear 4-strand or 3-strand
pattern in the batch. This pattern corresponds to the “O-kay-Goo-
gle” 4-phoneme or “Hey-Goo-gle” 3-phoneme structure of the key-
words.
2 In our experiments, for cross-layer attention, scores are learned from the
second-to-last layer.

5362

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [15] Haşim Sak, Andrew Senior, and Françoise Beaufays, “Long
short-term memory recurrent neural network architectures for
[1] Yury Pinsky, “Tomato, tomahto. google home now sup- large scale acoustic modeling,” in Fifteenth Annual Conference
ports multiple users,” https://www.blog.google/ of the International Speech Communication Association, 2014.
products/assistant/tomato-tomahto-google-
[16] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno,
home-now-supports-multiple-users, 2017.
“Generalized end-to-end loss for speaker verification,” arXiv
[2] Mihai Matei, “Voice match will allow google home to recog- preprint arXiv:1710.10467, 2017.
nize your voice,” https://www.androidheadlines.
com/2017/10/voice-match-will-allow- [17] Quan Wang, Carlton Downey, Li Wan, Philip Mansfield, and
google-home-to-recognize-your-voice.html, Ignacio Lopez Moreno, “Speaker diarization with lstm,” arXiv
2017. preprint arXiv:1710.10468, 2017.
[3] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Du-
mouchel, and Pierre Ouellet, “Front-end factor analysis for
speaker verification,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[4] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of
i-vector length normalization in speaker recognition systems.,”
in Interspeech, 2011, pp. 249–252.
[5] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep
learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[6] Johan Rohdin, Anna Silnova, Mireia Diez, Oldrich Plchot,
Pavel Matejka, and Lukas Burget, “End-to-end dnn based
speaker recognition inspired by i-vector and plda,” arXiv
preprint arXiv:1710.02369, 2017.
[7] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam
Shazeer, “End-to-end text-dependent speaker verification,”
in Acoustics, Speech and Signal Processing (ICASSP), 2016
IEEE International Conference on. IEEE, 2016, pp. 5115–
5119.
[8] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term
memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,
1997.
[9] Guoguo Chen, Carolina Parada, and Georg Heigold, “Small-
footprint keyword spotting using deep neural networks,” in
Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on. IEEE, 2014, pp. 4087–
4091.
[10] Rohit Prabhavalkar, Raziel Alvarez, Carolina Parada, Preetum
Nakkiran, and Tara N Sainath, “Automatic gain control and
multi-style training for robust small-footprint keyword spotting
with deep neural networks,” in Acoustics, Speech and Signal
Processing (ICASSP), 2015 IEEE International Conference on.
IEEE, 2015, pp. 4704–4708.
[11] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,
Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-
els for speech recognition,” in Advances in Neural Information
Processing Systems, 2015, pp. 577–585.
[12] Minh-Thang Luong, Hieu Pham, and Christopher D Manning,
“Effective approaches to attention-based neural machine trans-
lation,” arXiv preprint arXiv:1508.04025, 2015.
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Ben-
gio, “Show, attend and tell: Neural image caption generation
with visual attention,” in International Conference on Machine
Learning, 2015, pp. 2048–2057.
[14] Florian Schroff, Dmitry Kalenichenko, and James Philbin,
“Facenet: A unified embedding for face recognition and clus-
tering,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 815–823.

5363

Authorized licensed use limited to: Techno India. Downloaded on September 07,2022 at 06:59:16 UTC from IEEE Xplore. Restrictions apply.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6439)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1174)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1855)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Rciom PDF
88% (8)
Rciom PDF
16 pages
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Maplin Electronics 1983-09
No ratings yet
Maplin Electronics 1983-09
68 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Launch TLT240SB 2 Post Lift Manual PDF
No ratings yet
Launch TLT240SB 2 Post Lift Manual PDF
49 pages
Complexity Classes
No ratings yet
Complexity Classes
7 pages
Routine EVEN SEM
No ratings yet
Routine EVEN SEM
1 page
Provisional Routine Ca4 Odd 2024 25 Sem03 Sem05 Sem07-Final
No ratings yet
Provisional Routine Ca4 Odd 2024 25 Sem03 Sem05 Sem07-Final
69 pages
Diya CA2 Project Report01BFrontPage
No ratings yet
Diya CA2 Project Report01BFrontPage
1 page
Threadedbin TreeImplementation
No ratings yet
Threadedbin TreeImplementation
7 pages
Faculty End Sem 2024 Practical Routine
No ratings yet
Faculty End Sem 2024 Practical Routine
2 pages
Nit RFP Fms 14022017
No ratings yet
Nit RFP Fms 14022017
117 pages
CSEDS20
No ratings yet
CSEDS20
8 pages
Random Number
No ratings yet
Random Number
14 pages
Wdusb Manual
No ratings yet
Wdusb Manual
196 pages
PlanAhead Tutorial RTL Design IP
No ratings yet
PlanAhead Tutorial RTL Design IP
38 pages
Ug792 Pinplan
No ratings yet
Ug792 Pinplan
14 pages
On Fault-Tolerant Design of Exclusive-OR Gates in
No ratings yet
On Fault-Tolerant Design of Exclusive-OR Gates in
12 pages
Deep Feature Yuanliu Speechcom Liu
No ratings yet
Deep Feature Yuanliu Speechcom Liu
13 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
PCIe Vs PCI Power 26690 PDF
No ratings yet
PCIe Vs PCI Power 26690 PDF
3 pages
Solar Cells For Energy
No ratings yet
Solar Cells For Energy
13 pages
Ja Solar TDS 440-465 JAM72S20 MR MC4-EVO2 1200mm EN
No ratings yet
Ja Solar TDS 440-465 JAM72S20 MR MC4-EVO2 1200mm EN
2 pages
IdeaPad Slim 3 15IRH8 83EM009YIN
No ratings yet
IdeaPad Slim 3 15IRH8 83EM009YIN
2 pages
Broadcom NetLink-NetXtreme DTM 306
No ratings yet
Broadcom NetLink-NetXtreme DTM 306
79 pages
Task Sheet Windows and File MGMT 2
No ratings yet
Task Sheet Windows and File MGMT 2
3 pages
Organization: Structure and Culture (Chapter-3)
No ratings yet
Organization: Structure and Culture (Chapter-3)
23 pages
UD13466B - Baseline Wireless Output Expander Quick Start Guide V1.2 - 20181013 - Multilanguage
No ratings yet
UD13466B - Baseline Wireless Output Expander Quick Start Guide V1.2 - 20181013 - Multilanguage
3 pages
ICT Final Exam Grade 9
No ratings yet
ICT Final Exam Grade 9
3 pages
2014 Power Asset Management Methods and Experiences in Colombian Power System
No ratings yet
2014 Power Asset Management Methods and Experiences in Colombian Power System
6 pages
Lan Refresh Project: Selected Contributions
No ratings yet
Lan Refresh Project: Selected Contributions
39 pages
Test 1
No ratings yet
Test 1
7 pages
Voicemail Pro Installation
No ratings yet
Voicemail Pro Installation
440 pages
DSpace Manual PDF
No ratings yet
DSpace Manual PDF
805 pages
Rab Gedung LTC New
No ratings yet
Rab Gedung LTC New
24 pages
Transistor Biasing
100% (1)
Transistor Biasing
42 pages
IME Module 5 Notes
No ratings yet
IME Module 5 Notes
19 pages
Taking Attendance With Microsoft O Ce Forms
No ratings yet
Taking Attendance With Microsoft O Ce Forms
3 pages
Data Center Interconnect: New Revenue Opportunities
No ratings yet
Data Center Interconnect: New Revenue Opportunities
11 pages
transaction (1)
No ratings yet
transaction (1)
4 pages
Summary Scheduling and Sequencing
No ratings yet
Summary Scheduling and Sequencing
3 pages
IRC 37 - 2018.5pages
No ratings yet
IRC 37 - 2018.5pages
5 pages
Conger Electronics DAP 310 Dynamic Alignment System Manual
No ratings yet
Conger Electronics DAP 310 Dynamic Alignment System Manual
6 pages
Make A FAST Line Follower Robot Using PID
No ratings yet
Make A FAST Line Follower Robot Using PID
13 pages
Alexandra Pimentel Resume Final
No ratings yet
Alexandra Pimentel Resume Final
2 pages
1 - SC Segmentation - 10 Steps To Greater Profits - CSCMP's Supply Chain Quarterly
No ratings yet
1 - SC Segmentation - 10 Steps To Greater Profits - CSCMP's Supply Chain Quarterly
10 pages
G&SR NR 184
No ratings yet
G&SR NR 184
464 pages

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

Attention-Based Models For Text-Dependent Speaker Verification Moreno

Uploaded by

ATTENTION-BASED MODELS FOR TEXT-DEPENDENT SPEAKER VERIFICATION

F A Rezaur Rahman Chowdhury∗ Quan Wang, Ignacio Lopez Moreno, Li Wan

Washington State University Google Inc., USA

ABSTRACT However, one challenge in our architecture introduced in [7]

978-1-5386-4658-8/18/$31.00 ©2018 IEEE 5359 ICASSP 2018

6SHDNHU DYHUDJH ,QSXW 

3. ATTENTION-BASED MODEL (b)

3.2. Scoring functions et = fSNL (ht ) = vT tanh(Wht + b). (11)

To fairly compare different attention techniques, we use the same

Test data Non-attention Basic attention

Table 3: Evaluation EER(%): Different pooling methods for atten- WLPH

Test data No pooling Sliding window Top-K

4.3. Variants When we apply sliding window maxpooling or global top-K

You might also like

6SHDNHU DYHUDJH ,QSXW