0% found this document useful (0 votes)

5 views82 pages

03 Attention

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views82 pages

03 Attention

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 82

Last Time: Recurrent Neural Networks

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 3 May 02, 2023
Last Time: Variable length computation L
graph with shared weights
y1 L1 y2 L2 y3 L3 yT LT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 4 May 02, 2023
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 26 May 02, 2023
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: h0 = fW(z)
where z is spatial CNN features
fW(.) is an MLP

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 27 May 02, 2023
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person

where z is spatial CNN features
fW(.) is an MLP y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 28 May 02, 2023
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing

where z is spatial CNN features
fW(.) is an MLP y1 y2

z0,0 z0,1 z0,2

h0 h1 h2
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 29 May 02, 2023
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing hat

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3

z0,0 z0,1 z0,2

h0 h1 h2 h3
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 30 May 02, 2023
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing hat [END]

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 31 May 02, 2023
Image Captioning using spatial features
Problem: Input is "bottlenecked" through c
- Model needs to encode everything it
wants to say within c
person wearing hat [END]
This is a problem if we want to generate
really long descriptions? 100s of words long
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 32 May 02, 2023
Image Captioning with RNNs and Attention
gif source

Attention idea: New context vector at every time step.

Each context vector will attend to different image regions

Attention Saccades in humans

z0,0 z0,1 z0,2
h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 33 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores:
Compute alignments HxW
scores (scalars):
e1,0,0 e1,0,1 e1,0,2

e1,1,0 e1,1,1 e1,1,2

fatt(.) is an MLP
e1,2,0 e1,2,1 e
1,2,2

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 34 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 35 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get Compute context vector:
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 1,1,2 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 36 May 02, 2023
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 37 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores: Attention: Decoder: yt = gV(yt-1, ht-1, ct)
HxW HxW New context vector at every time step
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

person
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 38 May 02, 2023
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing

y1 y2

z0,0 z0,1 z0,2

h0 h1 h2
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 39 May 02, 2023
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat

y1 y2 y3

z0,0 z0,1 z0,2

h0 h1 h2 h3
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 40 May 02, 2023
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 41 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 42 May 02, 2023
Image Captioning with Attention

Soft attention

Hard attention
(requires
reinforcement
learning)

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 43 May 02, 2023
Image Captioning with Attention

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 44 May 02, 2023
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 45 May 02, 2023
Attention we just saw in image captioning

z0,0 z0,1 z0,2

Features

z1,0 z1,1 z1,2

z2,0 z2,1 z2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 46 May 02, 2023
Attention we just saw in image captioning

Operations:
Alignment: ei,j = fatt(h, zi,j)

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 47 May 02, 2023
Attention we just saw in image captioning

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)

softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 48 May 02, 2023
Attention we just saw in image captioning
c
Outputs:
context vector: c (shape: D)
mul + add

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)
Output: c = ∑i,j ai,jzi,j
softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 49 May 02, 2023
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
a1 Operations:
Alignment: ei = fatt(h, xi)
a2
Attention: a = softmax(e)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1
Attention operation is permutation invariant.
x2 e2 - Doesn't care about ordering of the features
- Stretch H x W = N into N vectors
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 50 May 02, 2023
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
Change fatt(.) to a simple dot product
a1 Operations: - only works well with key & value
Alignment: ei = h ᐧ xi transformation trick (will mention in a
a2
Attention: a = softmax(e) few slides)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1

x2 e2

Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 51 May 02, 2023
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
Change fatt(.) to a scaled simple dot product
a0 - Larger dimensions means more terms in

Attention
the dot product sum.
a1 Operations: - So, the variance of the logits is higher.
Alignment: ei = h ᐧ xi / √D Large magnitude vectors will produce
a2
Attention: a = softmax(e) much higher logits.
Output: c = ∑i ai xi - So, the post-softmax distribution has
softmax lower-entropy, assuming logits are IID.
Input vectors

- Ultimately, these large magnitude

x0 e0 vectors will cause softmax to peak and
Alignment

assign very little weight to all others

x1 e1 - Divide by √D to reduce effect of large
magnitude vectors
x2 e2

Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 52 May 02, 2023
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)
Multiple query vectors
- each query creates a new output
a0,0 a0,1 a0,2
context vector

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2
Alignment: ei,j = qj ᐧ xi / √D
Attention: a = softmax(e)
Output: yj = ∑i ai,j xi
softmax (↑)
Input vectors

x0 e0,0 e0,1 e0,2

Alignment

x1 e1,0 e1,1 e1,2

x2 e2,0 e2,1 e2,2

Multiple query vectors

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 53 May 02, 2023
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)

a0,0 a0,1 a0,2

Notice that the input vectors are used for

Attention
a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Alignment: ei,j = qj ᐧ xi / √D attention calculations.
a2,0 a2,1 a2,2
Attention: a = softmax(e) - We can add more expressivity to
Output: yj = ∑i ai,j xi the layer by adding a different FC
softmax (↑) layer before each of the two steps.
Input vectors

x0 e0,0 e0,1 e0,2

Alignment

x1 e1,0 e1,1 e1,2

x2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 54 May 02, 2023
General attention layer

v0
Notice that the input vectors are used for
v1 Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 - We can add more expressivity to
Value vectors: v = xWv
the layer by adding a different FC
layer before each of the two steps.
Input vectors

x0 k0

x1 k1

x2 k2
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 55 May 02, 2023
General attention layer
y0 y1 y2
Outputs:
The input and output dimensions can
context vectors: y (shape: Dv)
mul(→) + add (↑) now change depending on the key and
value FC layers
v0 a0,0 a0,1 a0,2
Notice that the input vectors are used for

Attention
v1 a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 a2,0 a2,1 a2,2
- We can add more expressivity to
Value vectors: v = xWv
Alignment: ei,j = qj ᐧ ki / √D the layer by adding a different FC
softmax (↑) Attention: a = softmax(e) layer before each of the two steps.
Input vectors

Output: yj = ∑i ai,j vi
x0 k0 e0,0 e0,1 e0,2
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 56 May 02, 2023
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

Recall that the query vector was a

v0 a0,0 a0,1 a0,2
function of the input vectors

Attention
v1 a1,0 a1,1 a1,2
Operations: Encoder: h0 = fW(z)
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
where z is spatial CNN features
Alignment: ei,j = qj ᐧ ki / √D fW(.) is an MLP
softmax (↑) Attention: a = softmax(e)
z0,0 z0,1 z0,2
Input vectors

Output: yj = ∑i ai,j vi h0
x0 k0 e0,0 e0,1 e0,2
CNN z1,0 z1,1 z1,2 MLP
Alignment

x1 k1 e1,0 e1,1 e1,2 z2,0 z2,1 z2,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 57 May 02, 2023
Self attention layer

We can calculate the query vectors

from the input vectors, therefore,
defining a "self-attention" layer.
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq Instead, query vectors are
Alignment: ei,j = qj ᐧ ki / √D calculated using a FC layer.
Input vectors

Attention: a = softmax(e)
x0 Output: yj = ∑i ai,j vi

x2
No input query vectors anymore
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 58 May 02, 2023
Self attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 59 May 02, 2023
Self attention layer - attends over sets of inputs
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
y0 y1 y2
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
self-attention
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D x0 x1 x2
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 60 May 02, 2023
Self attention layer - attends over sets of inputs

y1 y0 y2 y2 y1 y0 y0 y1 y2

self-attention self-attention self-attention

x1 x0 x2 x2 x1 x0 x0 x1 x2

Permutation equivariant

Self-attention layer doesn’t care about the orders of the inputs!

Problem: How can we encode ordered sequences like language or spatially ordered image features?

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 61 May 02, 2023
Positional encoding
y0 y1 y2

self-attention

x0 x1 x2
p0 p1 p2

position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate/add special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the vector 4. It must be deterministic.
into a d-dimensional vector

So, pj = pos(j)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 62 May 02, 2023
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2
p0 p1 p2

position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the 4. It must be deterministic.
vector into a d-dimensional vector

So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 63 May 02, 2023
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desiderata

p0 p1 p2

position encoding

x1 x0 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 64 May 02, 2023
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desiderata

p0 p1 p2

position encoding Intuition:

x0 x1 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
image source
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 65 May 02, 2023
Masked self-attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
v1 0 a1,1 a1,2
Operations: - Prevent vectors from
v2 0 0 a2,2
Key vectors: k = xWk looking at future vectors.
Value vectors: v = xWv
- Manually set alignment
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D scores to -infinity
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 -∞ e1,1 e1,2

x2 k2 -∞ -∞ e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 66 May 02, 2023
Multi-head self-attention layer
- Multiple self-attention heads in parallel
y0 y1 y2

Concatenate

head0 head1 headH-1

y0 y1 y2 y0 y1 y2 y0 y1 y2

Self-attention Self-attention ... Self-attention

x0 x1 x2 x0 x1 x2 x0 x1 x2

Split
x0 x1 x2

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 67 May 02, 2023
General attention versus self-attention

y0 y1 y2 y0 y1 y2

attention self-attention

k0 k1 k2 v0 v1 v2 q0 q1 q2 x0 x1 x2

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 68 May 02, 2023
Example: CNN with Self-Attention

Input Image

CNN
Features:
CxHxW
Cat image is free to use under the Pixabay License

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 69 May 02, 2023
Example: CNN with Self-Attention

Queries:
C’ x H x W

Input Image 1x1 Conv

Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 70 May 02, 2023
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 71 May 02, 2023
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 72 May 02, 2023
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x CxHxH
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 73 May 02, 2023
Example: CNN with Self-Attention
Residual Connection

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x CxHxW
Keys:
CNN C’ x H x W
+
1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Self-Attention Module
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 74 May 02, 2023
Comparing RNNs to Transformer

RNNs
(+) LSTMs work reasonably well for long sequences.
(-) Expects an ordered sequences of inputs
(-) Sequential computation: subsequent hidden states can only be computed after the previous
ones are done.

Transformer:
(+) Good at long sequences. Each attention calculation looks at all inputs.
(+) Can operate over unordered sets or ordered sequences with positional encodings.
(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and
stored for a single self-attention head. (but GPUs are getting bigger and better)

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 75 May 02, 2023
“ImageNet Moment for Natural Language Processing”

Pretraining:
Download a lot of text from the internet

Train a giant Transformer model for language modeling

Finetuning:
Fine-tune the Transformer on your own NLP task

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 76 May 02, 2023
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 77 May 02, 2023
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 78 May 02, 2023
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: c = TW(z)
where z is spatial CNN features
TW(.) is the transformer encoder

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 79 May 02, 2023
Image Captioning using Transformers
Input: Image I Decoder: yt = TD(y0:t-1, c)
Output: Sequence y = y1, y2,..., yT where TD(.) is the transformer decoder

Encoder: c = TW(z) person wearing hat [END]

where z is spatial CNN features
TW(.) is the transformer encoder y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 80 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
Made up of N encoder blocks.

In vaswani et al. N = 6, Dq= 512

z0,0 z0,1 z0,2 ... z2,2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 81 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
Let's dive into one encoder block

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 82 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 83 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 84 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 85 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 86 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN MLP MLP over each vector individually

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 87 May 02, 2023
...
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

+ Residual connection
Transformer encoder

xN MLP MLP over each vector individually

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 88 May 02, 2023
...
The Transformer encoder block
y0 y1 y2 y3
c0,0 c0,1 c0,2 ... c2,2 Transformer Encoder Block:
Layer norm
Inputs: Set of vectors x
+ Outputs: Set of vectors y
Transformer encoder

MLP Self-attention is the only

xN
interaction between vectors.
Layer norm

+
Layer norm and MLP operate
independently per vector.
Multi-head self-attention
Highly scalable, highly
Positional encoding parallelizable, but high memory usage.

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 89 May 02, 2023
The
...
Transformer
decoder block
person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Made up of N decoder blocks.

xN In vaswani et al. N = 6, Dq= 512

c0,0

c0,1

c0,2
...

c2,2
y0 y1 y2 y3

[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 90 May 02, 2023
The
...
Transformer y0 y1 y2 y3

decoder block FC

person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Let's dive into the
c0,0
transformer decoder block
xN
c0,1
c0,0
c0,2

...
c0,1

c0,2 c2,2
...

c2,2
y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 91 May 02, 2023
The
...
Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Most of the network is the
c0,0
same the transformer
xN encoder.
c0,1
c0,0
c0,2
Layer norm

...
c0,1
+

c0,2 c2,2
Masked Multi-head
self-attention
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 92 May 02, 2023
The
...
Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Multi-head attention block
c0,0
attends over the transformer
xN Multi-head attention encoder outputs.
c0,1 k v q
c0,0
c0,2 For image captioning, this is
Layer norm
how we inject image

...
c0,1
+
features into the decoder.
c0,2 c2,2
Masked Multi-head
self-attention
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 93 May 02, 2023
The
...
Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
Transformer Decoder Block:

person wearing hat [END] Inputs: Set of vectors x and

+
Set of context vectors c.
y0 y1 y2 y3 Outputs: Set of vectors y.
MLP
Layer norm
Masked Self-attention only

Transformer decoder
c0,0
+ interacts with past inputs.
xN Multi-head attention
c0,1 k v q Multi-head attention block is
c0,0
NOT self-attention. It attends
c0,2 over encoder outputs.
Layer norm

...
c0,1
+
Highly scalable, highly
c0,2 c2,2
Masked Multi-head parallelizable, but high memory
self-attention usage.
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 94 May 02, 2023
Image Captioning using transformers
- No recurrence at all

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 95 May 02, 2023
Image Captioning using transformers
- Perhaps we don't need
convolutions at all?

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 96 May 02, 2023
Image Captioning using ONLY transformers
- Transformers from pixels to language

person wearing hat [END]

y1 y2 y3 y4

c0,0 c0,1 c0,2 ... c2,2

Transformer decoder

Transformer encoder

y0 y1 y2 y3
...

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
[START] person wearing hat
Colab link to an implementation of vision transformers

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 97 May 02, 2023
Vision Transformers vs. ResNets

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 98 May 02, 2023
Vision Transformers

Liu et al, “Swin Transformer: Hierarchical Vision

Fan et al, “Multiscale Vision Transformers”, ICCV 2021 Transformer using Shifted Windows”, CVPR 2021

Carion et al, “End-to-End Object Detection with Transformers”,

ECCV 2020

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 99 May 02, 2023
ConvNets strike back!

A ConvNet for the 2020s. Liu et al. CVPR 2022

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 100 May 02, 2023
Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 101 May 02, 2023
Summary
- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step
- The general attention layer is a new type of layer that can be
used to design new neural network architectures
- Transformers are a type of layer that uses self-attention and
layer norm.
○ It is highly scalable and highly parallelizable
○ Faster training, larger models, better performance across
vision and language tasks
○ They are quickly replacing RNNs, LSTMs, and may(?) even
replace convolutions.

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 102 May 02, 2023
Next time: Video Understanding

Fei-Fei Li, Yunzhu Li, Ruohan Gao Lecture 9 - 103 May 02, 2023

05 DeepLearning 05 Transformers Stanford (1)
No ratings yet
05 DeepLearning 05 Transformers Stanford (1)
82 pages
lec16b-Attention-13-Feb-18
No ratings yet
lec16b-Attention-13-Feb-18
53 pages
Image Captioning
No ratings yet
Image Captioning
45 pages
Show Attend and Tell
No ratings yet
Show Attend and Tell
10 pages
Lec25 Architectures
No ratings yet
Lec25 Architectures
52 pages
Caption Generation With Visual Attention
No ratings yet
Caption Generation With Visual Attention
25 pages
Show, Attend and Tell: Neural Image Caption Generation With Visual Attention
No ratings yet
Show, Attend and Tell: Neural Image Caption Generation With Visual Attention
22 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
L5
No ratings yet
L5
99 pages
2501
No ratings yet
2501
6 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
Ai Image Captioning
No ratings yet
Ai Image Captioning
10 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
Knowing When To Look - Adaptive Attention Via A Visual Sentinel For Image Captioning
No ratings yet
Knowing When To Look - Adaptive Attention Via A Visual Sentinel For Image Captioning
9 pages
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
No ratings yet
Convolutional Neural Networks: Computer Vision CS 543 / ECE 549 University of Illinois Jia-Bin Huang
76 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
No ratings yet
Knowing When To Look-Adaptive Attention Via A Visual Sentinel For Image Captioning
12 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Attention_ Attention! _ Lil'Log
No ratings yet
Attention_ Attention! _ Lil'Log
23 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Project Review
No ratings yet
Project Review
12 pages
Deep Learning Important Studies
No ratings yet
Deep Learning Important Studies
6 pages
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
No ratings yet
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
10 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Lecture-28-TransformerIntroductionFinal-1
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
69 pages
attention_transformer
No ratings yet
attention_transformer
41 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Generating_Image_Captions_Using_Bahdanau_Attention
No ratings yet
Generating_Image_Captions_Using_Bahdanau_Attention
19 pages
Variational Autoencoder for Deep Learning of Images, Labels and Captions
No ratings yet
Variational Autoencoder for Deep Learning of Images, Labels and Captions
9 pages
deep captioning with MRNN
No ratings yet
deep captioning with MRNN
17 pages
Review 2
No ratings yet
Review 2
34 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
Lecture2.2 UnimodalRepresentations Part1 PDF
No ratings yet
Lecture2.2 UnimodalRepresentations Part1 PDF
92 pages
Object Detection Using Convolutional Neural Network Transfer Learning
No ratings yet
Object Detection Using Convolutional Neural Network Transfer Learning
11 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Hota-ML32
No ratings yet
Hota-ML32
9 pages
Automatic Image Captioning Bot With CNN and RNN: - Submitted By-Harkirat Singh CSE-3 01976802717
No ratings yet
Automatic Image Captioning Bot With CNN and RNN: - Submitted By-Harkirat Singh CSE-3 01976802717
10 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Lec 1 Intro
No ratings yet
Lec 1 Intro
54 pages
Recent advances in convolutional neural networks-2018
No ratings yet
Recent advances in convolutional neural networks-2018
42 pages
Image_Captioning_with_Visual_Attention.pdf
No ratings yet
Image_Captioning_with_Visual_Attention.pdf
16 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
MLSP Project Report
No ratings yet
MLSP Project Report
2 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
Exercise1 Q
No ratings yet
Exercise1 Q
2 pages
Unit Ii: Interpolation and Approximation: XXXX XX Yyx FX y X XX X X X
No ratings yet
Unit Ii: Interpolation and Approximation: XXXX XX Yyx FX y X XX X X X
21 pages
Daa Lab Manual 21cs42
No ratings yet
Daa Lab Manual 21cs42
44 pages
Jacobi, Gauss Seidel, LU Decomposition
No ratings yet
Jacobi, Gauss Seidel, LU Decomposition
7 pages
2C NOTES Polynomial Graphs PDF
No ratings yet
2C NOTES Polynomial Graphs PDF
7 pages
Programming Assignment 1
No ratings yet
Programming Assignment 1
2 pages
1d7656c1-07ac-40b0-820b-f1d2b9cd3065
No ratings yet
1d7656c1-07ac-40b0-820b-f1d2b9cd3065
2 pages
Bellman Ford Algorithm
No ratings yet
Bellman Ford Algorithm
4 pages
MVC Econ Summer22-PracticeMidterm
No ratings yet
MVC Econ Summer22-PracticeMidterm
4 pages
Finding Optimal Path: By: Dr. Anjali Diwan
No ratings yet
Finding Optimal Path: By: Dr. Anjali Diwan
40 pages
CHP 2
No ratings yet
CHP 2
38 pages
CLL113 Term Paper PDF
No ratings yet
CLL113 Term Paper PDF
8 pages
Computer Graphics - Lecture
No ratings yet
Computer Graphics - Lecture
5 pages
Quadratic Equation Using Graph
No ratings yet
Quadratic Equation Using Graph
4 pages
Abhishek Tiwari PPT OTM Z
No ratings yet
Abhishek Tiwari PPT OTM Z
11 pages
insertion SORT AND LINEAR SEARCH
No ratings yet
insertion SORT AND LINEAR SEARCH
5 pages
ADA Lab Manual Updated 2023-24 NEW
No ratings yet
ADA Lab Manual Updated 2023-24 NEW
36 pages
Lab 4 Newton Divided Difference Lagrange Interpolation: Objectives
No ratings yet
Lab 4 Newton Divided Difference Lagrange Interpolation: Objectives
9 pages
ISYE4200 HW#1 Fall2021
No ratings yet
ISYE4200 HW#1 Fall2021
3 pages
machine learning
No ratings yet
machine learning
5 pages
Winograd Algorithm For Fast Convolution
50% (2)
Winograd Algorithm For Fast Convolution
21 pages
K-MEANS-FINAL
No ratings yet
K-MEANS-FINAL
10 pages
AI.2a-Solving Problems by Searching (5-10)
No ratings yet
AI.2a-Solving Problems by Searching (5-10)
96 pages
DLP Lesson 4 - Week 1
No ratings yet
DLP Lesson 4 - Week 1
5 pages
Linear Programming Problems
No ratings yet
Linear Programming Problems
24 pages
Transportation Problem VAM
No ratings yet
Transportation Problem VAM
16 pages
Linear Non Linear Regression
No ratings yet
Linear Non Linear Regression
2 pages
Part 4 Transportation Problem
No ratings yet
Part 4 Transportation Problem
66 pages
Lecture 22 Energy-Based Models - Hopfield Network
No ratings yet
Lecture 22 Energy-Based Models - Hopfield Network
57 pages
Module08 PolynomialRegressionSplineGAMs
No ratings yet
Module08 PolynomialRegressionSplineGAMs
56 pages