Understanding LSTM Networks

This document explains Long Short Term Memory (LSTM) networks, a specialized type of recurrent neural network (RNN) designed to effectively learn long-term dependencies in data. It discusses the limitations of traditional RNNs in handling long-term context and highlights how LSTMs overcome these challenges through their unique architecture, which includes mechanisms like gates and cell states. The document also touches on various LSTM variants and the future potential of attention mechanisms in RNN research.

Uploaded by

bhavitmathangi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views10 pages

Understanding LSTM Networks

Uploaded by

bhavitmathangi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Understanding LSTM

Networks
Posted on August 27, 2015

Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you
understand each word based on your understanding of previous words. You don’t throw
everything away and start thinking from scratch again. Your thoughts have persistence.
Traditional neural networks can’t do this, and it seems like a major shortcoming. For
example, imagine you want to classify what kind of event is happening at every point in a
movie. It’s unclear how a traditional neural network could use its reasoning about
previous events in the film to inform later ones.
Recurrent neural networks address this issue. They are networks with loops in them,
allowing information to persist.
Recurrent Neural Networks have loops.

outputs a value htℎ�. A loop allows information to be passed from one step of the
In the above diagram, a chunk of neural network, A�, looks at some input xt�� and

network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you
think a bit more, it turns out that they aren’t all that different than a normal neural network.
A recurrent neural network can be thought of as multiple copies of the same network,
each passing a message to a successor. Consider what happens if we unroll the loop:
An unrolled recurrent neural network.
This chain-like nature reveals that recurrent neural networks are intimately related to
sequences and lists. They’re the natural architecture of neural network to use for such
data.
And they certainly are used! In the last few years, there have been incredible success
applying RNNs to a variety of problems: speech recognition, language modeling,
translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats
one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable
Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.
Essential to these successes is the use of “LSTMs,” a very special kind of recurrent
neural network which works, for many tasks, much much better than the standard
version. Almost all exciting results based on recurrent neural networks are achieved with
them. It’s these LSTMs that this essay will explore.

The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous
information to the present task, such as using previous video frames might inform the
understanding of the present frame. If RNNs could do this, they’d be extremely useful. But
can they? It depends.
Sometimes, we only need to look at recent information to perform the present task. For
example, consider a language model trying to predict the next word based on the
previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we
don’t need any further context – it’s pretty obvious the next word is going to be sky. In
such cases, where the gap between the relevant information and the place that it’s
needed is small, RNNs can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last
word in the text “I grew up in France… I speak fluent French.” Recent information
suggests that the next word is probably the name of a language, but if we want to narrow
down which language, we need the context of France, from further back. It’s entirely
possible for the gap between the relevant information and the point where it is needed to
become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the
information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A

human could carefully pick parameters for them to solve toy problems of this form. Sadly,
in practice, RNNs don’t seem to be able to learn them. The problem was explored in
depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty
fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem!

LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of
RNN, capable of learning long-term dependencies. They were introduced by Hochreiter &
Schmidhuber (1997), and were refined and popularized by many people in following
work.1 They work tremendously well on a large variety of problems, and are now widely
used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behavior, not something they
struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural
network. In standard RNNs, this repeating module will have a very simple structure, such
as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different
structure. Instead of having a single neural network layer, there are four, interacting in a
very special way.

The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram
step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

In the above diagram, each line carries an entire vector, from the output of one node to
the inputs of others. The pink circles represent pointwise operations, like vector addition,
while the yellow boxes are learned neural network layers. Lines merging denote
concatenation, while a line forking denote its content being copied and the copies going to
different locations.
The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the horizontal line running through the top of the
diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with
only some minor linear interactions. It’s very easy for information to just flow along it
unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully
regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a
sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each
component should be let through. A value of zero means “let nothing through,” while a
value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from

looks at ht−1ℎ�−1 and xt��, and outputs a number between 00 and 11 for each
the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It

number in the cell state Ct−1��−1. A 11 represents “completely keep this” while
a 00 represents “completely get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based
on all the previous ones. In such a problem, the cell state might include the gender of the
present subject, so that the correct pronouns can be used. When we see a new subject,
we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This
has two parts. First, a sigmoid layer called the “input gate layer” decides which values
we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t�~�, that
could be added to the state. In the next step, we’ll combine these two to create an update
to the state.
In the example of our language model, we’d want to add the gender of the new subject to
the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, Ct−1��−1, into the new cell state Ct��. The
previous steps already decided what to do, we just need to actually do it.
We multiply the old state by ft��, forgetting the things we decided to forget earlier.
Then we add it∗C~t��∗�~�. This is the new candidate values, scaled by how much
we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about
the old subject’s gender and add the new information, as we decided in the previous
steps.
Finally, we need to decide what we’re going to output. This output will be based on our
cell state, but will be a filtered version. First, we run a sigmoid layer which decides what
parts of the cell state we’re going to output. Then, we put the cell state
through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the
output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output
information relevant to a verb, in case that’s what is coming next. For example, it might
output whether the subject is singular or plural, so that we know what form a verb should
be conjugated into if that’s what follows next.

Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as
the above. In fact, it seems like almost every paper involving LSTMs uses a slightly
different version. The differences are minor, but it’s worth mentioning some of them.
One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding
“peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some
peepholes and not others.
Another variation is to use coupled forget and input gates. Instead of separately deciding
what to forget and what we should add new information to, we make those decisions
together. We only forget when we’re going to input something in its place. We only input
new values to the state when we forget something older.
A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU,
introduced by Cho, et al. (2014). It combines the forget and input gates into a single
“update gate.” It also merges the cell state and hidden state, and makes some other
changes. The resulting model is simpler than standard LSTM models, and has been
growing increasingly popular.

These are only a few of the most notable LSTM variants. There are lots of others, like
Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different
approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al.
(2014).
Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice
comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al.
(2015) tested more than ten thousand RNN architectures, finding some that worked better
than LSTMs on certain tasks.

Conclusion
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially
all of these are achieved using LSTMs. They really work a lot better for most tasks!
Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking
through them step by step in this essay has made them a bit more approachable.
LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is
there another big step? A common opinion among researchers is: “Yes! There is a next
step and it’s attention!” The idea is to let every step of an RNN pick information to look at
from some larger collection of information. For example, if you are using an RNN to
create a caption describing an image, it might pick a part of the image to look at for every
word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if
you want to explore attention! There’s been a number of really exciting results using
attention, and it seems like a lot more are around the corner…
Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs
by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in
generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer &
Osendorfer (2015) – also seems very interesting. The last few years have been an
exciting time for recurrent neural networks, and the coming ones promise to only be more
so!

Acknowledgments
I’m grateful to a number of people for helping me better understand LSTMs, commenting
on the visualizations, and providing feedback on this post.
I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol
Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to
many other friends and colleagues for taking the time to help me, including Dario Amodei,
and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful
correspondence about my diagrams.
Before this post, I practiced explaining LSTMs during two seminar series I taught on
neural networks. Thanks to everyone who participated in those for their patience with me,
and for their feedback.

What is LSTM - Long Short Term Memory_ - GeeksforGeeks
No ratings yet
What is LSTM - Long Short Term Memory_ - GeeksforGeeks
10 pages
OlahLSTM NEURAL NETWORK TUTORIAL 15
No ratings yet
OlahLSTM NEURAL NETWORK TUTORIAL 15
9 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
7 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
15 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
7 pages
Understanding LSTM Networks
No ratings yet
Understanding LSTM Networks
8 pages
Long Short Term Memory (LSTM)
No ratings yet
Long Short Term Memory (LSTM)
33 pages
LSTM Material 1
No ratings yet
LSTM Material 1
3 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
15 pages
RNN & LSTM
No ratings yet
RNN & LSTM
12 pages
Long Short-Term Memory Networks (LSTM)- simply explained! _ Data Basecamp
No ratings yet
Long Short-Term Memory Networks (LSTM)- simply explained! _ Data Basecamp
4 pages
lstm
No ratings yet
lstm
12 pages
Unit 3 Deep Learning SPPU BE IT
No ratings yet
Unit 3 Deep Learning SPPU BE IT
30 pages
RNNs and LSTMs
No ratings yet
RNNs and LSTMs
41 pages
LSTM
No ratings yet
LSTM
22 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Colah Github Io Posts 2015 08 Understanding LSTMs
No ratings yet
Colah Github Io Posts 2015 08 Understanding LSTMs
16 pages
RNN and LSTM
No ratings yet
RNN and LSTM
32 pages
LSTM
No ratings yet
LSTM
19 pages
Rnn With Lstm
No ratings yet
Rnn With Lstm
36 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
No ratings yet
Illustrated Guide To LSTM's and GRU'S - A Step by Step Explanation - by Michael Phi - Towards Data Science
15 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
longshorttermmemorylstm-231215171600-1feb7b1b
No ratings yet
longshorttermmemorylstm-231215171600-1feb7b1b
17 pages
Module 6
No ratings yet
Module 6
42 pages
Survey On Recurrent Neural Network in Natural Lang
No ratings yet
Survey On Recurrent Neural Network in Natural Lang
5 pages
Long Short-Term Memory (LSTM)
No ratings yet
Long Short-Term Memory (LSTM)
25 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
Day 4
No ratings yet
Day 4
22 pages
LSTM Presentation
No ratings yet
LSTM Presentation
23 pages
Final PDL_Unit IV
No ratings yet
Final PDL_Unit IV
51 pages
LSTM
No ratings yet
LSTM
27 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
Unit 3
No ratings yet
Unit 3
8 pages
Sequence Modeling
No ratings yet
Sequence Modeling
131 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
Unit 4
No ratings yet
Unit 4
27 pages
GRU
No ratings yet
GRU
17 pages
LSTM Networks Thesis Updated
No ratings yet
LSTM Networks Thesis Updated
5 pages
What is an RNN
No ratings yet
What is an RNN
6 pages
UNIT-III
No ratings yet
UNIT-III
5 pages
Neural Networks
No ratings yet
Neural Networks
22 pages
Unit 4 - DL
No ratings yet
Unit 4 - DL
23 pages
Session2 2024_2025_ Natural Language Processing
No ratings yet
Session2 2024_2025_ Natural Language Processing
30 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
Ch10
No ratings yet
Ch10
40 pages
Deep Learning (MODULE-5)
No ratings yet
Deep Learning (MODULE-5)
71 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
lec-10
No ratings yet
lec-10
37 pages
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
No ratings yet
A Review of Recurrent Neural Networks - LSTM Cells and Network Architectures (Neural Computation) (2019)
36 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
DLT UNIT-4
No ratings yet
DLT UNIT-4
18 pages
Long Short Term Memory LSTM Based Deep Learning for Sentiment Analysis of English and Spanish Data
No ratings yet
Long Short Term Memory LSTM Based Deep Learning for Sentiment Analysis of English and Spanish Data
5 pages
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
No ratings yet
34-Long-Term Dependencies - Echo State Networks - Long Short-Term Memory and Othe-03!10!2024
14 pages
Week 6 (1)
No ratings yet
Week 6 (1)
60 pages
Unit 4 - Machine Learning
No ratings yet
Unit 4 - Machine Learning
16 pages
lecture 11
No ratings yet
lecture 11
57 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
From Everand
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
Fouad Sabry
No ratings yet
Dilated Neighborhood Attention Transformer
No ratings yet
Dilated Neighborhood Attention Transformer
17 pages
Synthesizing Underwater Sounds Using Generative Artificial Intelligence
No ratings yet
Synthesizing Underwater Sounds Using Generative Artificial Intelligence
103 pages
Random Forests
No ratings yet
Random Forests
22 pages
140 Quality Analysis of Vegetables Using Machine Learning Techniques
No ratings yet
140 Quality Analysis of Vegetables Using Machine Learning Techniques
28 pages
Lecture 2
No ratings yet
Lecture 2
69 pages
KRAI Practical
No ratings yet
KRAI Practical
14 pages
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
No ratings yet
Deep Residual Learning For Image Recognition: Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research
7 pages
Neural Networks From Scratch: 3.1 Formal Neuron
No ratings yet
Neural Networks From Scratch: 3.1 Formal Neuron
8 pages
Fuzzylite Slides 3.1
No ratings yet
Fuzzylite Slides 3.1
42 pages
Traffic Sign Recognition Project
No ratings yet
Traffic Sign Recognition Project
9 pages
lec15_parameter_update2 (1)
No ratings yet
lec15_parameter_update2 (1)
4 pages
2023 Question Paper
No ratings yet
2023 Question Paper
2 pages
Dap_Latex(ENG)
No ratings yet
Dap_Latex(ENG)
11 pages
Gym For Robots
No ratings yet
Gym For Robots
2 pages
mtech-ai-ml
No ratings yet
mtech-ai-ml
29 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
Deep Learning From Scratch
No ratings yet
Deep Learning From Scratch
96 pages
Towards Neural Numeric-To-Text Generation From Temporal Personal Health Data
No ratings yet
Towards Neural Numeric-To-Text Generation From Temporal Personal Health Data
5 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
CS771: Introduction To Machine Learning Piyush Rai
25 pages
2023 AN2DL DeepLearning Intro
No ratings yet
2023 AN2DL DeepLearning Intro
122 pages
Digit Recognizer Using CNN
No ratings yet
Digit Recognizer Using CNN
4 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
13 pages
Unit-5 Machine Learning
No ratings yet
Unit-5 Machine Learning
25 pages
Adobe Research - MDSRTeamFlyerV3 (2)
No ratings yet
Adobe Research - MDSRTeamFlyerV3 (2)
3 pages
Unit3-Neural Networks (Solution)
No ratings yet
Unit3-Neural Networks (Solution)
3 pages
Lecture 2.3.4GAN
No ratings yet
Lecture 2.3.4GAN
4 pages
STABLE DIFFUSION WITH GENERATIVE AI
No ratings yet
STABLE DIFFUSION WITH GENERATIVE AI
3 pages
Predictive Maintenance For Smart Industry
No ratings yet
Predictive Maintenance For Smart Industry
51 pages
AI - 100 Days Learning Plan
No ratings yet
AI - 100 Days Learning Plan
9 pages