Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Recurrent Neural Network
How does RNN reduce complexity?
y1 y2 y3
h0 f h1 f h2 f h3 ……
x1 x2 x3
No matter how long the input/output sequence is, we only need
one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)
…
z1 z2 z3
g0 f2 g1 f2 g2 f2 g3 ……
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3
g0 f2 g1 f2 g2 f2 g3
z1 z2 z3
p=f3(y,z) f3 p1 f3 p2 f3 p3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3
x1 x2 x3
Significantly speed up training
Pyramid RNN
Reducing the number of time steps
Bidirectional
RNN
y
h' Wh h Wi x
h f h'
y Wo h’ Note, y is computed
from h’
x softmax
Ct-1
ht-1
Forget input
gate gate
The core idea is this cell
Why sigmoid or tanh:
state Ct, it is changed
Sigmoid: 0,1 gating as switch.
slowly, with only minor
Vanishing gradient problem in
linear interactions. It
LSTM is handled already. is very
easy for
ReLU information
replaces to flow
tanh ok?
along it unchanged.
it decides what component
is to be updated.
C’t provides change contents
yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN
xt xt
ct-1 xt
zi = Wi
σ( ht-1)
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = Wf
σ(
ht-1)
zf zi z zo
xt
zo = Wo
σ(
h t-1
x t ht-1)
z =tanh( W ht-1 )
ct-1 ct-1
diagonal
“peephole” z z z
o f i obtained by the same way
zf zi z zo
ht-1 xt
Information flow of LSTM
Element-wise multiply
yt
ct-1 ct
ct = zf ct-1 + ziz
tanh ht = zo tanh(ct)
yt = σ(W’ ht)
zf zi z zo
ht-1 xt ht
yt yt+1
ct-1 ct ct+1
tanh tanh
zf zi z zo zf zi z zo
ht+1
ht-1 xt ht xt+1
x f1 a1 f2 a2 f3 a3 f4 y
t is layer
a = ft(a ) = σ(W a + b )
t t-1 t t-1 t
h0 f h1 f h2 f h3 f g y4
x1 x2 x3 x4
t is time step
No input xt at
each step at-1
h
t-1
ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
at is the output of r z h'
the t-th layer
No reset gate
hat-1
t-1
xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z at-1 + (1-z) h
at-1 at-1
c c’ c c’
LSTM Grid
LSTM
h h’ h h’
x a b
time
Grid LSTM
h' b'
a’ b’ c c'
a tanh a'
c c’
Grid
LSTM
h h’ zf zi z zo
a b
h b
You can generalize this to 3D, and more.
Applications of LSTM / RNN
Neural machine translation
LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi
M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Baidu’s speech recognition using RNN
Attention
Image caption generation using attention
(From CY Lee lecture)
z0 is initial parameter, it is also learned
A vector for
each region
z0 match 0.7
weighted
CNN filter filter filter sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0