Chapter 5 - RNN Updated
Chapter 5 - RNN Updated
=
•
•
•
•
•
•
•
•
•
5.3. Backpropagation Through Time (BPTT)
• BPTT is a two-pass algorithm for training the weights in RNNs.
– In the first pass, we perform forward inference, computing h(t),
ŷ(t), and accumulating the loss at each time step.
– In the second phase, we process the sequence in reverse,
computing the required error terms (gradients) as we go,
computing and saving the error term for use in the hidden layer
for each step backward in time.
• This general approach is commonly referred to as
Backpropagation Through Time (BPTT).
• BPTT is the generalization of backpropagation algorithm used in
feed forward neural networks.
• The BPTT training algorithm aims to modify the weights of an
RNN network to minimize the error of the network results
compared to some expected output in response to corresponding
inputs.
5.3. Backpropagation Through Time…
•
5.3. Backpropagation Through Time…
•
L1 L2 L3 L4 L5
W
h0 h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
•
5.3. Backpropagation Through Time…
•
5.3. Backpropagation Through Time…
•
•
5.3. Backpropagation Through Time…
•
5.3. Backpropagation Through Time…
•
•
•
explicit implicit
explicit implicit
•
5.3. Backpropagation Through Time…
L1 L2 L3 L4 L5
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
Figure Unrolled Recurrent Neural Network for 5 time steps with gradients
5.3. Backpropagation Through Time…
•
•
•
•
•
5.3. Backpropagation Through Time…
• RNNs struggle to model long-term dependencies.
• Let us take an example to fully understand what we mean by
long-term dependencies.
• If you observe the two sentences above, you will notice how the
word “Dog” at the very beginning of the sentence influences the
word “has” which is at the very end.
• If we change the singular word to a plural word “Dogs”, there is a
direct effect on the word “have” which is very far from the
influencer word “Dogs”.
5.3. Challenges of Training RNN
• Now, the sentence in between can get longer than our liking and is
not under our control.
• Something like, say,
“The dog, which ran out of the door and sat on the neighbor’s porch
for three hours and roamed the street for a couple of hours more, has
come back.”
• So, for a word that’s almost at the end of a very long sentence to
be influenced by a word which is almost at the beginning of that
sentence, is what we call a “long-term dependency”.
• The basic RNN that we have seen so far are not very good at
handling such long-term dependencies, mainly due to the
Vanishing Gradient Problem.
5.3. Challenges of Training RNN…
Figure how RNN can forget earlier information when the sequence is long
5.3. Challenges of Training RNN…
• Shortly after the first RNNs were trained using backpropagation,
the problems of learning long-term dependencies became salient.
• This is due to vanishing and exploding gradients.
• There are two major obstacles RNNs have had to deal with:
– exploding gradients, and
– vanishing gradients
• A gradient is a partial derivative with respect to its inputs.
• Just think of it like this:
– a gradient measures how much the output of a function changes
if you change the inputs a little bit.
• You can also think of a gradient as the slope of a function.
• The higher the gradient, the steeper the slope and the faster a
model can learn.
• But if the slope is zero, the model stops learning.
5.3. Challenges of Training RNN…
• A gradient simply measures the change in all weights with regard
to the change in error.
• Exploding gradients happens when the algorithm assigns a very
high importance/gradient to the weights.
• Fortunately, this problem can be easily solved by truncating or
squashing the gradients.
• Vanishing gradients occur when the values of a gradient are too
small and the model stops learning or takes way too long as a
result.
• A lot of iterations will cause the new weights to be extremely
negligible and this leads to the weights not being updated.
• This was a major problem in the 1990s and much harder to solve
than the exploding gradients.
• Fortunately, it was solved through the concept of LSTM.
• What causes the vanishing gradient problem?
• The vanishing gradient problem is caused by the fact that while
the process of backpropagation goes on, the gradient of the early
layers (the layers that are nearest to the input layer) are derived by
multiplying the gradients of the later layers (the layers that are
near the output layer).
• Therefore, if the gradients of later layers are less than one, their
multiplication vanishes at a particularly rapid pace.
Figure Computing the input gate, the forget gate, and the output gate in LSTM
5.5. Long Short-Term Memory…
•
5.5. Long Short-Term Memory…
•
5.5. Long Short-Term Memory…
• A quick illustration of the input node is shown in the figure
below.
h1 h2 hn
x1 x2 x… xn
Figure encoder
Decoder
• The decoder generates the output sequence by predicting the next
output Yt given the context vector C.
• The input for the decoder is the context vector obtained at the end
of encoder model.
• Each time step will have three inputs: hidden vector from previous
time step ht-1, the previous time output yt-1, and original context
vector C.
• At the first time step, the output vector of encoder and the random
symbol <START>, empty hidden state h0 will be given as input
• The output obtained will be y1 and updated hidden state h1.
• The second time step will have the updated hidden state h1 and the
previous output y1 and original context vector C as current inputs,
and produces the hidden vector h2 and output y2.
• The outputs that occurred at each timestep of the decoder is the
actual output.
• The model will predict the output until the <END> symbol occurs.
5.8. Sequence to Sequence Models…
• The idea of encoder-decoder is as follows:
– An encoder or input RNN processes the input sequence. The
encoder emits the context C, usually as a simple function of its
final hidden state.
– A decoder is conditioned on that fixed-length vector C to
generate the output sequence Y = (y1, … ,yn).