Recurrent Neural Networks
Recurrent Neural Networks
GoogLeNet
AlexNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 4 May 2, 2019
Last Time: CNN Architectures
ResNet
SENet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 5 May 2, 2019
Comparing complexity...
Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 6 May 2, 2019
Efficient networks...
MobileNets: Efficient Convolutional Neural Networks for
Mobile Applications
[Howard et al. 2017]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 8 May 2, 2019
Meta-learning: Learning to learn network architectures...
Learning Transferable Architectures for Scalable Image
Recognition
[Zoph et al. 2017]
- Applying neural architecture search (NAS) to a
large dataset like ImageNet is expensive
- Design a search space of building blocks
(“cells”) that can be flexibly stacked
- NASNet: Use NAS to find best cell structure
on smaller CIFAR-10 dataset, then transfer
architecture to ImageNet
- Many follow-up works in this
space e.g. AmoebaNet (Real et
al. 2019) and ENAS (Pham,
Guan et al. 2018)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 9 May 2, 2019
Today: Recurrent Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 10 May 2, 2019
“Vanilla” Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 11 May 2, 2019
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 12 May 2, 2019
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 13 May 2, 2019
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 14 May 2, 2019
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 15 May 2, 2019
Sequential Processing of Non-Sequence Data
Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra,
2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 16 May 2, 2019
Sequential Processing of Non-Sequence Data
Generate images one piece at a time!
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with
permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 17 May 2, 2019
RecurrentNeuralNetwork
RNN
y
Key idea: RNNs have an
“internal state” that is
updated as a sequence is
RNN processed
RNN
new state old state input vector at
some time step
some function x
with parameters W
RNN
RNN
x
Sometimes called a “Vanilla RNN” or an
“Elman RNN” after Prof. Jeffrey Elman
h0 fW h1
x1
h0 fW h1 fW h2
x1 x2
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x
W
h
fW
h
fW
h
fW
h … h
0 1 2 3 T
x x x
W
1 2 3
1
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
h
fW
h
fW
h
fW
h … h
fW
h
fW
h
fW …
0 1 2 3 T 1 2
x x x
W W
1 2 3
1 2
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Language Model .84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Loss
(https://gist.github.com/karpathy/d4dee
566867f8291f086)
RNN
train more
train more
VIOLA:
I'll drink it.
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
if statement cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
quote/comment cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 2, 2019
test image
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 2, 2019
test image
X
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 2, 2019
test image
x0
<STA
RT>
y0
before:
h = tanh(Wxh * x + Whh * h)
h0
Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<STA
RT>
y0
sample!
h0
x0
<STA straw
RT>
y0 y1
h0 h1
x0
<STA straw
RT>
y0 y1
h0 h1
sample!
x0
<STA straw hat
RT>
y0 y1 y2
h0 h1 h2
x0
<STA straw hat
RT>
y0 y1 y2
sample
<END> token
h0 h1 h2 => finish.
x0
<STA straw hat
RT>
A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass
Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track
A bird is perched on
a tree branch
A man in a
baseball uniform
throwing a ball
A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with
permission.
CNN h0
Features:
Image: LxD
HxWx3
a1
CNN h0
Features:
Image: LxD
HxWx3
a1
CNN h0
Features:
Image: LxD
HxWx3 Weighted
z1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination
Attention”, ICML 2015 of features
a1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
a1 a2 d1
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
a1 a2 d1 a3 d2
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
Soft attention
Hard attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with
permission.
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with
permission.
LSTM:
depth
time
W tanh
ht-1 stack ht
xt
Backpropagation from ht
to ht-1 multiplies
T
by W
(actually Whh )
W tanh
ht-1 stack ht
xt
h0 h1 h2 h3 h4
x1 x2 x3 x4
Computing gradient
of h0 involves many
factors of W
(and repeated tanh)
h0 h1 h2 h3 h4
x1 x2 x3 x4
h0 h1 h2 h3 h4
x1 x2 x3 x4
h0 h1 h2 h3 h4
x1 x2 x3 x4
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
ct-1 ☉ + ct
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 98 May
May 2,
2, 2019
2019
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Backpropagation from ct to
ct-1 only elementwise
ct-1 multiplication by f, no matrix
☉ + ct multiply by W
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 99 May
May 2,
2, 2019
2019
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 100 May
May 2,
2, 2019
2019
c3
May 2,
Lecture 10 - 101 May 2019
2, 2019
Uninterrupted gradient flow!
Long Short Term Memory (LSTM): Gradient Flow
c2
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Similar to ResNet!
c0
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
In between:
Highway Networks
3x3 conv, 128 / 2
7x7 conv, 64 / 2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Similar to ResNet!
FC 1000
Softmax
Input
...
Pool
Pool
Srivastava et al, “Highway Networks”,
ICML DL Workshop 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 102 May
May 2,
2, 2019
2019
[An Empirical Exploration of
Other RNN Variants Recurrent Network Architectures,
Jozefowicz et al., 2015]
GRU [Learning phrase representations using rnn
encoder-decoder for statistical machine translation,
Cho et al. 2014]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 103 May
May 2,
2, 2019
2019
Recently in Natural Language Processing…
New paradigms for reasoning over sequences
[“Attention is all you need”, Vaswani et al., 2018]
- New “Transformer” architecture no longer
processes inputs sequentially; instead it can
operate over inputs in a sequence in parallel
through an attention mechanism
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 104 May
May 2,
2, 2019
2019
Summary
- RNNs allow a lot of flexibility in architecture design
- Vanilla RNNs are simple but don’t work very well
Common to use LSTM or GRU: their additive interactions
improve gradient flow
- Backward flow of gradients in RNN can explode or vanish.
Exploding is controlled with gradient clipping. Vanishing is
controlled with additive interactions (LSTM)
- Better/simpler architectures are a hot topic of current research,
as well as new paradigms for reasoning over sequences
- Better understanding (both theoretical and empirical) is needed.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 105 May
May 2,
2, 2019
2019
Next time: Midterm!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 106 May 2, 2019