ch10 Sequence Modelling - Recurrent and Recursive Nets
ch10 Sequence Modelling - Recurrent and Recursive Nets
Jen-Tzung Chien
1
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
2
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Unfolding flow graph is a flow graph that has a repetitive structure, typically
corresponding to a chain of events
• s(t) = f (s(t−1)θ)
3
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
4
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
5
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Recurrent networks with recurrent connections between hidden units that read
an entire sequence and then produce a single output
6
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
7
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Total loss is the sum of the losses over all the time steps
L({x (1)
∑ (t) , ..., x (τ )
}, {y (1)
, ..., y (τ )
})
= ∑t L
= t − log pmodel(y (t)|{x(1), ..., x(t)})
8
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• The network with recurrent connections only from the output at one time step
to the hidden units at the next time step is less powerful because it lacks
hidden-to-hidden recurrent connections
• Models that have recurrent connections from their outputs leading back into
the model may be trained with teacher forcing
9
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
10
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
∂L
(t)
=1
∂L
∂L ∂L ∂L(t) (t)
(∇o(t) L)i = (t)
= = ŷi − 1i,y (t)
∂oi ∂L(t) ∂o(t)
i
∇h(τ ) L = V ⊤∇o(τ ) L
∇h(τ ) L
∂h(t+1) ⊤ (t+1) ∂o(t) ⊤
=( ) (∇ L) + ( (t) ) (∇o(t) L)
∂h(t) ∂h
= W ⊤(∇h(t+1) L)diag(1 − (h(t+1))2) + V ⊤(∇o(t) L)
11
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
∑ ∂o(t) ∑
⊤
∇c L = ( ) ∇o(t) L = ∇o(t) L
t
∂c t
∑ ∂h(t) ∑
⊤
∇b L = ( (t) ) ∇h(t) L = diag(1 − (h(t))2)∇h(t) L
t
∂b t
∑ ∑ ∂L ∑
(t) (t) ⊤
∇V L = ( (t) )∇V oi = (∇o(t) L)h
t i ∂oi t
∑ ∑ ∂L ∑
(t) (t−1) ⊤
∇W L = ( (t) )∇W (t) hi = diag(1 − (h ) )(∇h(t) L)h
(t) 2
t i ∂hi t
∑ ∑ ∂L (t)
∑ ⊤
∇U L = ( (t) )∇U (t) hi = diag(1 − (h(t))2)(∇h(t) L)x(t)
t i ∂hi t
12
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
∏
τ
P (Y) = P (y (1), ..., y (τ )) = P (y (t)|y (t−1), y (t−2), ..., y (1))
t=1
∑
L= L(t)
t
13
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
14
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
15
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Bidirectional RNNs
16
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
17
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• encoder-decoder or sequence-to-sequence
architecture
1. an encoder or reader or input RNN
processes the input sequence
2. a decoder or writer or output RNN is
conditioned on that fixed-length vector to
generate the output sequence
18
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
19
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
20
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Right: The lengthened shortest path linking different time steps can be
mitigated by introduced skip connections
21
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
22
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
23
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
24
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• The basic problem is that gradients propagated over many stages tend to
either vanish or explode
• Even if we assume that the parameters are such that the recurrent network is
stable, the difficulty with long-term dependencies arises from the exponentially
smaller weights given to long-term interactions compared to short-term ones
25
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Very deep feed-forward networks with carefully chosen scaling can thus avoid
the vanishing and exploding gradient problem
26
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
27
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• One approach to avoiding this difficulty is to set the recurrent weights such
that the recurrent hidden units can capture the history of past inputs, and
learn only the output weights
• Viewing the recurrent net as a dynamical system and setting the input and
recurrent weights such that the dynamical system is near the edge of stability
can do it
• The strategy of echo state networks is simply to fix the weights to have some
spectral radius such as 3, where information is carried forward through time
but does not explode due to the stabilizing effect of saturating non-linearities
like tanh
• The techniques used to set the weights in ESNs could be used to initialize the
weights in a recurrent network, helping to learn long-term dependencies
28
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
29
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• One way to obtain coarse time scales is to add direct connections from variables
in the distant past to variables in the present
• There are two basic strategies for setting the time constants used by leaky
units
1. manually fix them to values that remain constant
2. make the time constants free parameters and learn them
• Another approach is the idea of organizing the state of the RNN at multiple
time-scales, with information flowing more easily through long distances at the
slower time scales
30
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
31
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• The most effective sequence models used in practical applications are called
gated RNNs, including the long short-term memory and networks based on
the gated recurrent unit
• Like leaky units, gated RNNs are based on the idea of creating paths through
time that have derivatives that neither vanish nor explode
• Different from leaky units, gated RNNs generalize this to connection weights
that may change at each time step
32
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• The clever idea of introducing self-loops to produce paths where the gradient
can flow for long durations is a core contribution of the initial long short-term
memory (LSTM) model
• Each cell has the same inputs and outputs as a vanilla recurrent network, but
has more parameters and a system of gating units that controls the flow of
information
33
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
34
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
35
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• The main difference with the LSTM is that a single gating unit simultaneously
controls the forgetting factor and the decision to update the state unit
• update equations:
(t) (t−1) (t−1) (t−1) ∑ (t−1) ∑ (t−1) (t−1)
hi = ui hi + (1 − ui )σ(bi + Ui,j xj + Wi,j rj hj )
j j
(t) ∑ u (t)
∑ u (t)
ui = σ(bui + Ui,j xj + Wi,j hj )
j j
(t) ∑ r (t)
∑ r (t)
ri = σ(bri + Ui,j xj + Wi,j hj )
j j
u: ”update” gate : act like conditional leaky integrators that can linearly gate
any dimension
r: ”reset” gate : control which parts of the state get used to compute the
next target state
36
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
37
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• An interesting idea is that second derivatives may vanish at the same time
that first derivatives vanish
• If the second derivative shrinks at a similar rate to the first derivative, then
the ratio of first and second derivatives may remain relatively constant
• Some approaches have largely been replaced by simply using SGD applied to
LSTMs could achieve promising results
38
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Clipping Gradients
• The difficulty that arises is that when the parameter gradient is very large, a
gradient descent parameter update could throw the parameters very far, into
a region where the objective function is larger, undoing much of the work that
had been done to reach the current solution
39
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Clipping Gradients
• A simple type of solution has been in use by practitioners for many years:
clipping the gradient
• Another is to clip the norm of the gradient just before the parameter update
40
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
• Gradient clipping helps to deal with exploding gradients, but it does not help
with vanishing gradients
• One approach is with LSTM and self-loops and gating mechanisms, another
idea is to regularize or constrain the parameters so as to encourage ”information
flow”
(t)
• we want (∇h(t) L) ∂h
∂h
(t−1) to be as large as ∇h(t) L
(t)
∑ ∥(∇h(t) L) ∂h∂h(t−1) ∥
• Ω= ( ∥∇ L∥ − 1)2
t h(t)
41
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Outline
• Unfolding Computational Graphs
• Recurrent Neural Networks
• Bidirectional RNNs
• Encoder-Decoder Sequence-to-Sequence Architectures
• Deep Recurrent Networks
• Recursive Neural Networks
• The Challenge of Long-Term Dependencies
• Echo State Networks
• Leaky Units and Other Strategies for Multiple Time Scales
• The Long Short-Term Memory and Other Gated RNNs
• Optimization for Long-Term Dependencies
• Explicit Memory
42
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Explicit Memory
• Neural networks excel at storing implicit knowledge but struggle to memorize
facts
• Neural networks lack the equivalent of the working memory system that allows
human beings to explicitly hold and manipulate pieces of information that are
relevant to achieving some goal
• To resolve this difficulty, memory networks that include a set of memory cells
was introduced
• These memory cells are typically augmented to contain a vector, rather than
the single scalar stored by an LSTM or GRU memory cell
• If the content of a memory cell is copied (not forgotten) at most time steps,
then the information it contains can be propagated forward in time and the
gradients propagated backward in time without either vanishing or exploding
43
Sequence Modelling: Recurrent and Recursive Nets Deep Learning
Explicit Memory
• The task network can choose to read from or write to specific memory
addresses
• Explicit memory seems to allow models to learn tasks that ordinary RNNs or
LSTM RNNs cannot learn
44