Understanding LSTM Networks
Understanding LSTM Networks
Networks
Posted on August 27, 2015
outputs a value htℎ�. A loop allows information to be passed from one step of the
In the above diagram, a chunk of neural network, A�, looks at some input xt�� and
LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of
RNN, capable of learning long-term dependencies. They were introduced by Hochreiter &
Schmidhuber (1997), and were refined and popularized by many people in following
work.1 They work tremendously well on a large variety of problems, and are now widely
used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behavior, not something they
struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural
network. In standard RNNs, this repeating module will have a very simple structure, such
as a single tanh layer.
In the above diagram, each line carries an entire vector, from the output of one node to
the inputs of others. The pink circles represent pointwise operations, like vector addition,
while the yellow boxes are learned neural network layers. Lines merging denote
concatenation, while a line forking denote its content being copied and the copies going to
different locations.
The Core Idea Behind LSTMs
The key to LSTMs is the cell state, the horizontal line running through the top of the
diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with
only some minor linear interactions. It’s very easy for information to just flow along it
unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully
regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a
sigmoid neural net layer and a pointwise multiplication operation.
The sigmoid layer outputs numbers between zero and one, describing how much of each
component should be let through. A value of zero means “let nothing through,” while a
value of one means “let everything through!”
An LSTM has three of these gates, to protect and control the cell state.
looks at ht−1ℎ�−1 and xt��, and outputs a number between 00 and 11 for each
the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It
number in the cell state Ct−1��−1. A 11 represents “completely keep this” while
a 00 represents “completely get rid of this.”
Let’s go back to our example of a language model trying to predict the next word based
on all the previous ones. In such a problem, the cell state might include the gender of the
present subject, so that the correct pronouns can be used. When we see a new subject,
we want to forget the gender of the old subject.
The next step is to decide what new information we’re going to store in the cell state. This
has two parts. First, a sigmoid layer called the “input gate layer” decides which values
we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t�~�, that
could be added to the state. In the next step, we’ll combine these two to create an update
to the state.
In the example of our language model, we’d want to add the gender of the new subject to
the cell state, to replace the old one we’re forgetting.
It’s now time to update the old cell state, Ct−1��−1, into the new cell state Ct��. The
previous steps already decided what to do, we just need to actually do it.
We multiply the old state by ft��, forgetting the things we decided to forget earlier.
Then we add it∗C~t��∗�~�. This is the new candidate values, scaled by how much
we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about
the old subject’s gender and add the new information, as we decided in the previous
steps.
Finally, we need to decide what we’re going to output. This output will be based on our
cell state, but will be a filtered version. First, we run a sigmoid layer which decides what
parts of the cell state we’re going to output. Then, we put the cell state
through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the
output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output
information relevant to a verb, in case that’s what is coming next. For example, it might
output whether the subject is singular or plural, so that we know what form a verb should
be conjugated into if that’s what follows next.
The above diagram adds peepholes to all the gates, but many papers will give some
peepholes and not others.
Another variation is to use coupled forget and input gates. Instead of separately deciding
what to forget and what we should add new information to, we make those decisions
together. We only forget when we’re going to input something in its place. We only input
new values to the state when we forget something older.
A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU,
introduced by Cho, et al. (2014). It combines the forget and input gates into a single
“update gate.” It also merges the cell state and hidden state, and makes some other
changes. The resulting model is simpler than standard LSTM models, and has been
growing increasingly popular.
These are only a few of the most notable LSTM variants. There are lots of others, like
Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different
approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al.
(2014).
Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice
comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al.
(2015) tested more than ten thousand RNN architectures, finding some that worked better
than LSTMs on certain tasks.
Conclusion
Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially
all of these are achieved using LSTMs. They really work a lot better for most tasks!
Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking
through them step by step in this essay has made them a bit more approachable.
LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is
there another big step? A common opinion among researchers is: “Yes! There is a next
step and it’s attention!” The idea is to let every step of an RNN pick information to look at
from some larger collection of information. For example, if you are using an RNN to
create a caption describing an image, it might pick a part of the image to look at for every
word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if
you want to explore attention! There’s been a number of really exciting results using
attention, and it seems like a lot more are around the corner…
Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs
by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in
generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer &
Osendorfer (2015) – also seems very interesting. The last few years have been an
exciting time for recurrent neural networks, and the coming ones promise to only be more
so!
Acknowledgments
I’m grateful to a number of people for helping me better understand LSTMs, commenting
on the visualizations, and providing feedback on this post.
I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol
Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to
many other friends and colleagues for taking the time to help me, including Dario Amodei,
and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful
correspondence about my diagrams.
Before this post, I practiced explaining LSTMs during two seminar series I taught on
neural networks. Thanks to everyone who participated in those for their patience with me,
and for their feedback.