0% found this document useful (0 votes)
26 views55 pages

dlc-slides-13-2-attention-mechanisms

Uploaded by

wrytmine101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views55 pages

dlc-slides-13-2-attention-mechanisms

Uploaded by

wrytmine101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Deep learning

13.2. Attention Mechanisms

François Fleuret
https://fleuret.org/dlc/
The most classical version of attention is a context-attention with a dot-product
for attention function, as used by Vaswani et al. (2017) for their transformer
models. We will come back to them.

Using the terminology of Graves et al. (2014), attention is an averaging of


values associated to keys matching a query. Hence the keys used for computing
attention and the values to average are different quantities.

François Fleuret Deep learning / 13.2. Attention Mechanisms 1 / 30



Given a query sequence Q ∈ RT ×D , a key sequence K ∈ RT ×D , and a value
′ ′ ′
sequence V ∈ RT ×D , compute an attention matrix A ∈ RT ×T by matching

Qs to K s, and weight V with it to get the result sequence Y ∈ RT ×D .

 
KQi
∀i, Ai = softmax √
D
Yi = V ⊤ Ai ,

or
QK ⊤
 
A = softmaxrow √
D
Y = AV .

The queries and keys have the same dimension D, and there are as many keys
T ′ as there are values. The result Y has as many rows T as there are queries,
and they are of same dimension D ′ as the values.

François Fleuret Deep learning / 13.2. Attention Mechanisms 2 / 30


[Tensors are depicted here transposed for ease of representation.]

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


[Tensors are depicted here transposed for ease of representation.]

 
KQi
Ai = softmax √
D

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


[Tensors are depicted here transposed for ease of representation.]

Yi = V ⊤ Ai

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


[Tensors are depicted here transposed for ease of representation.]

 
KQi
Ai = softmax √
D

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


[Tensors are depicted here transposed for ease of representation.]

Yi = V ⊤ Ai

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


[Tensors are depicted here transposed for ease of representation.]

 
KQi
Ai = softmax √ Yi = V ⊤ Ai
D

Y
V

K
Q

François Fleuret Deep learning / 13.2. Attention Mechanisms 3 / 30


Q

K ·⊤ softmax A · Y

QK ⊤
 
A = softmaxrow √
D
Y = AV .

Standard attention

François Fleuret Deep learning / 13.2. Attention Mechanisms 4 / 30


It may be useful to mask the attention matrix, for instance in the case of
self-attention, for computational reasons, or to make the model causal for
auto-regression.

(0)
(0)
queries

queries

queries
(0)

keys keys keys

Full attention Local attention Causal attention

|i − j| > ∆ ⇒ Ai,j = 0 j > i ⇒ Ai,j = 0

François Fleuret Deep learning / 13.2. Attention Mechanisms 5 / 30


Attention layers

François Fleuret Deep learning / 13.2. Attention Mechanisms 6 / 30


A standard attention layer takes as input two sequences X and X ′ and
computes the tensors K , V , and Q as per-row linear functions.


Q = XWQ

K = X ′W K

V = X ′W V
QK ⊤
 
A = softmaxrow √
D
Y = AV

François Fleuret Deep learning / 13.2. Attention Mechanisms 7 / 30


A standard attention layer takes as input two sequences X and X ′ and
computes the tensors K , V , and Q as per-row linear functions.

Y

Q = XWQ

K = X ′W K
A

V = X ′W V
QK ⊤
 
Q
A = softmaxrow √ K V
D
Y = AV
X

When X = X ′ , this is self attention,

François Fleuret Deep learning / 13.2. Attention Mechanisms 7 / 30


A standard attention layer takes as input two sequences X and X ′ and
computes the tensors K , V , and Q as per-row linear functions.

Y Y

Q = XWQ

K = X ′W K
A A

V = X ′W V
QK ⊤
 
Q Q
A = softmaxrow √ K V K V
D
Y = AV
X X X′

When X = X ′ , this is self attention, otherwise it is cross attention.

François Fleuret Deep learning / 13.2. Attention Mechanisms 7 / 30


A standard attention layer takes as input two sequences X and X ′ and
computes the tensors K , V , and Q as per-row linear functions.

Y Y

Q = XWQ

K = X ′W K
A A

V = X ′W V
QK ⊤
 
Q Q
A = softmaxrow √ K V K V
D
Y = AV
X X X′

When X = X ′ , this is self attention, otherwise it is cross attention.

Multi-head attention combines several such operations in parallel, and Y is the


concatenation of the results along the feature dimension to which is applied one
more linear transformation.

François Fleuret Deep learning / 13.2. Attention Mechanisms 7 / 30


Given a permutation σ and a 2d tensor X , let us use the following notation for
the permutation of the rows: σ(X )i = Xσ(i) .

The standard attention operation is invariant to a permutation of the keys and


values:

Y (Q, σ(K ), σ(V )) = Y (Q, K , V ),

and equivariant to a permutation of the queries, that is the resulting tensor is


permuted similarly:

Y (σ(Q), K , V ) = σ(Y (Q, K , V )).

François Fleuret Deep learning / 13.2. Attention Mechanisms 8 / 30


Given a permutation σ and a 2d tensor X , let us use the following notation for
the permutation of the rows: σ(X )i = Xσ(i) .

The standard attention operation is invariant to a permutation of the keys and


values:

Y (Q, σ(K ), σ(V )) = Y (Q, K , V ),

and equivariant to a permutation of the queries, that is the resulting tensor is


permuted similarly:

Y (σ(Q), K , V ) = σ(Y (Q, K , V )).

Consequently self attention and cross attention are equivariant to permutations


of X , and cross attention is invariant to permutations of X ′ .

François Fleuret Deep learning / 13.2. Attention Mechanisms 8 / 30


To illustrate the behavior of such an attention layer, we consider a toy
sequence-to-sequence problem with sequences composed of two triangular and
two rectangular patterns.

The target averages the heights in each pair of shapes.

François Fleuret Deep learning / 13.2. Attention Mechanisms 9 / 30


To illustrate the behavior of such an attention layer, we consider a toy
sequence-to-sequence problem with sequences composed of two triangular and
two rectangular patterns.

The target averages the heights in each pair of shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 9 / 30


To illustrate the behavior of such an attention layer, we consider a toy
sequence-to-sequence problem with sequences composed of two triangular and
two rectangular patterns.

The target averages the heights in each pair of shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 9 / 30


To illustrate the behavior of such an attention layer, we consider a toy
sequence-to-sequence problem with sequences composed of two triangular and
two rectangular patterns.

The target averages the heights in each pair of shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 9 / 30


To illustrate the behavior of such an attention layer, we consider a toy
sequence-to-sequence problem with sequences composed of two triangular and
two rectangular patterns.

The target averages the heights in each pair of shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 9 / 30


Some training examples.
Input Input
Target Target
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Target Target
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 10 / 30


We test first a 1d convolutional network, with no attention mechanism.

Sequential(
(0): Conv1d(1, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): Conv1d(64, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(3): ReLU()
(4): Conv1d(64, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(5): ReLU()
(6): Conv1d(64, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(7): ReLU()
(8): Conv1d(64, 1, kernel_size=(5,), stride=(1,), padding=(2,))
)

nb_parameters 62337

François Fleuret Deep learning / 13.2. Attention Mechanisms 11 / 30


Training is done with the MSE loss and Adam.

batch_size = 100

optimizer = torch.optim.Adam(model.parameters(), lr = 1e-3)


mse_loss = nn.MSELoss()

mu, std = train_input.mean(), train_input.std()

for e in range(args.nb_epochs):

for input, targets in zip(train_input.split(batch_size),


train_targets.split(batch_size)):

output = model((input - mu) / std)


loss = mse_loss(output, targets)

optimizer.zero_grad()
loss.backward()
optimizer.step()

François Fleuret Deep learning / 13.2. Attention Mechanisms 12 / 30


1600 Without attention

1400

1200

1000
MSE

800

600

400

200

0
100 101 102
Nb. of epochs

François Fleuret Deep learning / 13.2. Attention Mechanisms 13 / 30


Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 14 / 30


The poor performance of this model is not surprising given its inability to
transport information from “far away” in the signal. Using more layers, global
channel averaging, or fully connected layers could possibly solve the problem.

However it is more natural to equip the model with the ability to combine
information from parts of the signal that it actively identifies as relevant.

This is exactly what an attention layer would do.

François Fleuret Deep learning / 13.2. Attention Mechanisms 15 / 30


We implement our own self attention layer with tensors N × C × T so that the
products by WQ , WK , and WV can be implemented as convolutions.

To compute QK ⊤ and AV we need a batch matrix product, which is provided


by torch.matmul().

François Fleuret Deep learning / 13.2. Attention Mechanisms 16 / 30


>>> a = torch.rand(11, 9, 2, 3)
>>> b = torch.rand(11, 9, 3, 4)
>>> m = a.matmul(b)
>>> m.size()
torch.Size([11, 9, 2, 4])
>>>
>>> m[7, 1]
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>> a[7, 1].mm(b[7, 1])
tensor([[0.8839, 1.0253, 0.7473, 1.1397],
[0.4966, 0.5515, 0.4631, 0.6616]])
>>>
>>> m[3, 0]
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])
>>> a[3, 0].mm(b[3, 0])
tensor([[0.6906, 0.7657, 0.9310, 0.7547],
[0.6259, 0.5570, 1.1012, 1.2319]])

François Fleuret Deep learning / 13.2. Attention Mechanisms 17 / 30


class SelfAttentionLayer(nn.Module):
def __init__(self, in_dim, out_dim, key_dim):
super().__init__()
self.conv_Q = nn.Conv1d(in_dim, key_dim, kernel_size = 1, bias = False)
self.conv_K = nn.Conv1d(in_dim, key_dim, kernel_size = 1, bias = False)
self.conv_V = nn.Conv1d(in_dim, out_dim, kernel_size = 1, bias = False)

def forward(self, x):


Q = self.conv_Q(x)
K = self.conv_K(x)
V = self.conv_V(x)
A = Q.transpose(1, 2).matmul(K).softmax(2)
y = A.matmul(V.transpose(1, 2)).transpose(1, 2)
return y

Note that for simplicity it is single-head attention, and the 1/ D is missing.

François Fleuret Deep learning / 13.2. Attention Mechanisms 18 / 30


class SelfAttentionLayer(nn.Module):
def __init__(self, in_dim, out_dim, key_dim):
super().__init__()
self.conv_Q = nn.Conv1d(in_dim, key_dim, kernel_size = 1, bias = False)
self.conv_K = nn.Conv1d(in_dim, key_dim, kernel_size = 1, bias = False)
self.conv_V = nn.Conv1d(in_dim, out_dim, kernel_size = 1, bias = False)

def forward(self, x):


Q = self.conv_Q(x)
K = self.conv_K(x)
V = self.conv_V(x)
A = Q.transpose(1, 2).matmul(K).softmax(2)
y = A.matmul(V.transpose(1, 2)).transpose(1, 2)
return y

Note that for simplicity it is single-head attention, and the 1/ D is missing.

The computation of the attention matrix A and the layer’s output Y could also
be expressed somehow more clearly with Einstein summations (see lecture 1.5.
“High dimension tensors”) as

A = torch.einsum('nct,ncs->nts', Q, K).softmax(2)
y = torch.einsum('nts,ncs->nct', A, V)

François Fleuret Deep learning / 13.2. Attention Mechanisms 18 / 30


Sequential(
(0): Conv1d(1, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(1): ReLU()
(2): Conv1d(64, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(3): ReLU()
(4): SelfAttentionLayer(in_dim=64, out_dim=64, key_dim=64)
(5): Conv1d(64, 64, kernel_size=(5,), stride=(1,), padding=(2,))
(6): ReLU()
(7): Conv1d(64, 1, kernel_size=(5,), stride=(1,), padding=(2,))
)

nb_parameters 54081

François Fleuret Deep learning / 13.2. Attention Mechanisms 19 / 30


1600 Without attention
With attention

1400

1200

1000
MSE

800

600

400

200

0
100 101 102
Nb. of epochs

François Fleuret Deep learning / 13.2. Attention Mechanisms 20 / 30


Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 21 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 22 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 22 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 22 / 30


Because it is invariant to a permutation of the keys and values, such an
! attention layer disregards the absolute location of the values.

Our toy problem does not require to take into account the positioning in the
tensor. We can modify it with a target where the pairs to average are the two
rightmost and leftmost shapes.

François Fleuret Deep learning / 13.2. Attention Mechanisms 23 / 30


Because it is invariant to a permutation of the keys and values, such an
! attention layer disregards the absolute location of the values.

Our toy problem does not require to take into account the positioning in the
tensor. We can modify it with a target where the pairs to average are the two
rightmost and leftmost shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 23 / 30


Because it is invariant to a permutation of the keys and values, such an
! attention layer disregards the absolute location of the values.

Our toy problem does not require to take into account the positioning in the
tensor. We can modify it with a target where the pairs to average are the two
rightmost and leftmost shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 23 / 30


Because it is invariant to a permutation of the keys and values, such an
! attention layer disregards the absolute location of the values.

Our toy problem does not require to take into account the positioning in the
tensor. We can modify it with a target where the pairs to average are the two
rightmost and leftmost shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 23 / 30


Because it is invariant to a permutation of the keys and values, such an
! attention layer disregards the absolute location of the values.

Our toy problem does not require to take into account the positioning in the
tensor. We can modify it with a target where the pairs to average are the two
rightmost and leftmost shapes.

Input Target

François Fleuret Deep learning / 13.2. Attention Mechanisms 23 / 30


Some training examples.
Input Input
Target Target
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Target Target
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 24 / 30


With attention, no positional encoding
2500

2000

1500
MSE

1000

500

0
100 101 102
Nb. of epochs

François Fleuret Deep learning / 13.2. Attention Mechanisms 25 / 30


Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 26 / 30


The poor performance of this model is not surprising given its inability to take
into account positions in the attention layer.

We can fix this by providing to the model a positional encoding.

>>> len = 20
>>> c = math.ceil(math.log(len) / math.log(2.0))
>>> o = 2**torch.arange(c).unsqueeze(1)
>>> pe = (torch.arange(len).unsqueeze(0).div(o, rounding_mode = 'floor')) % 2
>>> pe
tensor([[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
[0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

Such a tensor can simply be channel-concatenated to the input batch:

>>> pe = pe[None].float()
>>> input = torch.cat((input, pe.expand(input.size(0), -1, -1)), 1)

François Fleuret Deep learning / 13.2. Attention Mechanisms 27 / 30


With attention, no positional encoding
2500
With attention, positional encoding

2000

1500
MSE

1000

500

0
100 101 102
Nb. of epochs

François Fleuret Deep learning / 13.2. Attention Mechanisms 28 / 30


Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

Input Input
Output Output
25 25

20 20

15 15

10 10

5 5

0 0
0 20 40 60 80 100 0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 29 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 30 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 30 / 30


100

Input
Output
25
80

20

60
15

10 40

5
20

0
0 20 40 60 80 100
0
0 20 40 60 80 100

François Fleuret Deep learning / 13.2. Attention Mechanisms 30 / 30


The End
References

A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. CoRR, abs/1410.5401,


2014.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and
I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.

You might also like