0% found this document useful (0 votes)
4 views

Chapter 5 - RNN Updated

Chapter 5 introduces Recurrent Neural Networks (RNNs), which are designed to handle sequential data where attributes are dependent on one another, such as time-series, text, and biological data. RNNs can process inputs of varying lengths and retain historical information, making them suitable for tasks like sentiment analysis and machine translation, although they face challenges like the vanishing gradient problem. The chapter also discusses the architecture of RNNs, including types like one-to-many and many-to-many, and introduces Long Short-Term Memory (LSTM) networks as a solution to issues with traditional RNNs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 5 - RNN Updated

Chapter 5 introduces Recurrent Neural Networks (RNNs), which are designed to handle sequential data where attributes are dependent on one another, such as time-series, text, and biological data. RNNs can process inputs of varying lengths and retain historical information, making them suitable for tasks like sentiment analysis and machine translation, although they face challenges like the vanishing gradient problem. The chapter also discusses the architecture of RNNs, including types like one-to-many and many-to-many, and introduces Long Short-Term Memory (LSTM) networks as a solution to issues with traditional RNNs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Chapter 5

Recurrent Neural Networks (RNN)


5.1. Introduction to RNN
• All the neural architectures discussed in the previous chapters are
designed for input data in which the attributes are largely
independent of one another.
• However, certain data types such as time-series, text, and biological
data contain sequential dependencies among the attributes.
• Examples of such dependencies are as follows:
• 1. In a time-series data set, the values on successive time-stamps are
closely related to one another.
• If one uses the values of these time-stamps as independent features,
then key information about the relationships among the values of
these time-stamps is lost.
• For example, the value of a time-series at time t is closely related to
its values in t-1 window.
• However, this information is lost when the values at individual
time-stamps are treated independently of one another.
• A time-series data set includes the following:
– Rainfall over the years
– Automated stock trading
– Temperature data of a month, a year, a decade, etc.
– Sales data over time
– Industry production data
• 2. Although text is often processed as a bag of words, one can
obtain better semantic insights when the ordering of the words is
used.
• In such cases, it is important to construct models that take the
sequential information into account.
• Text data is the most common use case of recurrent neural
networks.
• 3. Biological data often contains sequences, in which the symbols
might correspond to amino acids or one of the nucleobases that
form the building blocks of DNA.
• DNA sequencing is an important tool in biology and medicine.
• It can be used to:
– Identify genes and their mutations
– Diagnose diseases
– Develop new drugs and therapies
– Determine the genetic makeup of individuals and populations
– Understand evolution
• Many sequence-centric applications like text are often processed
as bags of words.
• Such an approach ignores the ordering of words in the document,
and works well for documents of reasonable size.
• However, in applications where the semantic interpretation of the
sentence is important, or in which the size of the text segment is
relatively small (e.g., a single sentence), such an approach is
simply inadequate.
• In order to understand this point, consider the following pair of
sentences:
The cat chased the mouse.
The mouse chased the cat.
• The two sentences are clearly very different.
• However, the bag-of-words representation would deem them identical.
• Hence, this type of representation works well for simpler applications
such as classification.
• But a greater degree of linguistic intelligence is required for more
sophisticated applications in difficult settings such as sentiment
analysis, machine translation, or information extraction.
• RNN enables us to solve such problems.
• Recurrent Neural Networks add an interesting twist to basic neural
networks.
• A vanilla neural network takes in a fixed size vector as input which
limits its usage in situations that involve a ‘series’ type input with no
predetermined size.
• An RNN is a network that contains a cycle within its network
connections.
• This means that the value of some unit is directly or indirectly
dependent on its own earlier outputs.
• Recurrent neural networks are a family of neural networks for
processing sequential data.
• Much as a convolutional network is a neural network that is
specialized for processing an image, a recurrent neural network is
a neural network that is specialized for processing a sequence of
values x(1), .. , x(T).
• Recurrent networks can scale to much longer sequences than
would be practical for networks without sequence-based
specialization.
• Recurrent networks can process sequences of variable length.
5.1. Introduction to RNN…

Figure rolled RNN


5.1. Introduction to RNN…
• For the simplicity of exposition, we refer to RNNs as operating on
a sequence that contains vectors x(t) for the time step t ranging
from 1 to T.
• In practice, RNNs usually operate on minibatches of such
sequences, with a different sequence length T for each member of
the minibatch.
• The time step index need not literally refer to the passage of time
in the real world.
• Sometimes it refers only to the position in the sequence.
• RNNs may also be applied in two dimensions across spatial data
such as images.
• Even when applied to data involving time, the network may have
connections that go back in time, provided that the entire
sequence is observed before it is provided to the network.
5.1. Introduction to RNN…
• RNN takes each vector from a sequence of input vectors and model
them one at a time.
• This allows the network to retain state while modeling each input
vector across the window of input vectors.
• Modeling the time dimension is the hallmark of RNN.
• RNNs process data step by step, with each step's output influenced
by both the current input and the network's internal state.
• This design allows RNNs to model temporal dynamics and
dependencies within sequences.
• Historically, these networks have been difficult to train.
• But more recently, advances in optimization, network architectures,
parallelism, and graphics processing units have made them more
approachable for the practitioner.
5.1. Introduction to RNN…

Figure An unrolled recurrent neural network


5.1. Introduction to RNN…
Types of RNN
• A feed-forward neural network maps one input to one output.
• However, RNNs have no such limitation.
• RNNs can map one input to many outputs, many inputs to many
outputs (e.g. translation) and many inputs to one output (e.g.
classifying sentiment of a sentence).

Figure Types of RNN


5.1. Introduction to RNN…
i. One-to-one
• One-to-one is a simple neural network.
• It is commonly used for machine learning problems that have a
single input and output.
• A one-to-one architecture is used in traditional neural networks.
• Traditional neural networks employ a one-to-one architecture.
ii. One-to-many
• One-to-many is where we have one input and a variable number
of outputs.
• One example application is image captioning.
• When a single image is provided as input, a variable number of
words which caption the image is returned as output.
iii. Many-to-One
• Many-to-one RNN have a variable number of inputs and a single
output.
• One example application is document sentiment classification.
• It is where a variable number of words in a document are
presented as input, and a single output predicts whether the
document has a positive or negative sentiment regarding a topic.
• Applications of many-to-one:
– Spam classification
– Sentiment classification
– Language classification
– Audio classification
iv. Many-to-many
• There are two types of many-to-many RNN.
– One in which the number of inputs and outputs match. For
example, in labeling the video frames, the number of frames
matches the number of labels.
– The other is in which the number of inputs and outputs do not
match.
• For example, in language translation we pass in n words in
English and get m words in Italian.
• Applications of many-to-many:
– Machine translation
– Document summarization
– Video frame classification
5.1. Introduction to RNN…
• Many-to-many RNN used for machine translation
5.1. Introduction to RNN…
• There are different types of many to many RNNs.
• The one shown on diagram below produces an output for each
input it accepts.
• A good example for this is video classification on frame level.
• Another example parts-of-speech-tagging (POS).
5.1. Introduction to RNN…
• RNNs have various advantages and disadvantages.
• The advantages are:
– Ability to handle sequence data
– Ability to handle inputs of varying lengths
– Ability to store or “memorize” historical information
• The disadvantages are:
– The computation can be very slow.
– The network does not take into account future inputs to make
decisions.
– Vanishing gradient problem, where the gradients used to
compute the weight update may get very close to zero,
preventing the network from learning new weights. The deeper
the network, the more pronounced this problem is.
5.2. Architecture of RNN
• In RNN, the hidden layer from the previous time step provides a
form of memory that encodes earlier processing and informs the
decisions to be made at later points in time.
• Critically, this approach does not impose a fixed-length limit on
this prior context; the context embodied in the previous hidden
layer can include information extending back to the beginning of
the sequence.
• Adding this temporal dimension makes RNNs appear to be more
complex than non-recurrent architectures.
• But in reality, they are not all that different.
• Given an input vector and the values for the hidden layer from the
previous time step, we are still performing the standard
feedforward calculation as before.
5.2. Architecture of RNN…
• To see this, consider the figure in the next slide which clarifies the
nature of the recurrence and how it factors into the computation at
the hidden layer.
• The most significant change lies in the new set of weights, W, that
connect the hidden layer from the previous time step to the current
hidden layer.
• These weights determine how the network makes use of past
context in calculating the output for the current input.
• RNNs share the same weights (W, U, V) across all time steps,
making them efficient for sequential tasks.
• As with the other weights in the network, these connections are
trained using backpropagation.
• They are updated during training using backpropagation through
time (BPTT).
5.2. Architecture of RNN…

Figure Simple recurrent neural network


5.2. Architecture of RNN…
• The following figure shows a simple recurrent neural network
shown unrolled in time.
• Network layers are recalculated for each time step, while the
weights U, V and W are shared in common across all time steps.

Figure RNN unrolled in time



5.2. Architecture of RNN…
• An RNN with 3 inputs, 2 hidden neurons and 3 outputs looks like
this.

=









5.3. Backpropagation Through Time (BPTT)
• BPTT is a two-pass algorithm for training the weights in RNNs.
– In the first pass, we perform forward inference, computing h(t),
ŷ(t), and accumulating the loss at each time step.
– In the second phase, we process the sequence in reverse,
computing the required error terms (gradients) as we go,
computing and saving the error term for use in the hidden layer
for each step backward in time.
• This general approach is commonly referred to as
Backpropagation Through Time (BPTT).
• BPTT is the generalization of backpropagation algorithm used in
feed forward neural networks.
• The BPTT training algorithm aims to modify the weights of an
RNN network to minimize the error of the network results
compared to some expected output in response to corresponding
inputs.
5.3. Backpropagation Through Time…

5.3. Backpropagation Through Time…

L1 L2 L3 L4 L5

W
h0 h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

5.3. Backpropagation Through Time…

5.3. Backpropagation Through Time…


5.3. Backpropagation Through Time…

5.3. Backpropagation Through Time…


explicit implicit

explicit implicit

5.3. Backpropagation Through Time…

L1 L2 L3 L4 L5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

Figure Unrolled Recurrent Neural Network for 5 time steps with gradients
5.3. Backpropagation Through Time…





5.3. Backpropagation Through Time…
• RNNs struggle to model long-term dependencies.
• Let us take an example to fully understand what we mean by
long-term dependencies.

• If you observe the two sentences above, you will notice how the
word “Dog” at the very beginning of the sentence influences the
word “has” which is at the very end.
• If we change the singular word to a plural word “Dogs”, there is a
direct effect on the word “have” which is very far from the
influencer word “Dogs”.
5.3. Challenges of Training RNN
• Now, the sentence in between can get longer than our liking and is
not under our control.
• Something like, say,
“The dog, which ran out of the door and sat on the neighbor’s porch
for three hours and roamed the street for a couple of hours more, has
come back.”
• So, for a word that’s almost at the end of a very long sentence to
be influenced by a word which is almost at the beginning of that
sentence, is what we call a “long-term dependency”.
• The basic RNN that we have seen so far are not very good at
handling such long-term dependencies, mainly due to the
Vanishing Gradient Problem.
5.3. Challenges of Training RNN…

Figure how RNN can forget earlier information when the sequence is long
5.3. Challenges of Training RNN…
• Shortly after the first RNNs were trained using backpropagation,
the problems of learning long-term dependencies became salient.
• This is due to vanishing and exploding gradients.
• There are two major obstacles RNNs have had to deal with:
– exploding gradients, and
– vanishing gradients
• A gradient is a partial derivative with respect to its inputs.
• Just think of it like this:
– a gradient measures how much the output of a function changes
if you change the inputs a little bit.
• You can also think of a gradient as the slope of a function.
• The higher the gradient, the steeper the slope and the faster a
model can learn.
• But if the slope is zero, the model stops learning.
5.3. Challenges of Training RNN…
• A gradient simply measures the change in all weights with regard
to the change in error.
• Exploding gradients happens when the algorithm assigns a very
high importance/gradient to the weights.
• Fortunately, this problem can be easily solved by truncating or
squashing the gradients.
• Vanishing gradients occur when the values of a gradient are too
small and the model stops learning or takes way too long as a
result.
• A lot of iterations will cause the new weights to be extremely
negligible and this leads to the weights not being updated.
• This was a major problem in the 1990s and much harder to solve
than the exploding gradients.
• Fortunately, it was solved through the concept of LSTM.
• What causes the vanishing gradient problem?
• The vanishing gradient problem is caused by the fact that while
the process of backpropagation goes on, the gradient of the early
layers (the layers that are nearest to the input layer) are derived by
multiplying the gradients of the later layers (the layers that are
near the output layer).
• Therefore, if the gradients of later layers are less than one, their
multiplication vanishes at a particularly rapid pace.

Figure vanishing gradient


5.5. Long Short-Term Memory (LSTM)
• One of the first and most successful techniques for addressing
vanishing gradients is the LSTM model.
• LSTMs resemble standard RNNs but each ordinary recurrent node
is replaced by a memory cell.
• Each memory cell contains an internal state, i.e., a node with a
self-connected recurrent edge of fixed weight 1, ensuring that the
gradient can pass across many time steps without vanishing or
exploding.
• The term “long short-term memory” comes from the following
intuition.
– Simple RNNs have long-term memory in the form of weights. The
weights change slowly during training, encoding general knowledge
about the data.
– They also have short-term memory in the form of hidden outputs
which pass from each node to successive nodes.
• The LSTM model introduces an intermediate type of storage via
the memory cell.
• LSTM networks are a type of RNN that use a special type of
memory cell to store and output information.
• These memory cells are designed to remember information for long
periods of time.
• They do this by using a set of “gates” that control the flow of
information into and out of the cell.
• Each memory cell in LSTM is equipped with an internal state and a
number of gates that determine whether:
– i. a given input should impact the internal state (the input gate)
– ii. the internal state should be flushed (the forget gate), and
– iii. the internal state should be allowed to impact the cell’s
output (the output gate).
• The key distinction between vanilla RNNs and LSTMs is that
LSTMs introduce a cell state alongside the hidden state, allowing
them to better manage long-term dependencies.
• LSTM have dedicated mechanisms for when a cell state should be
updated and also when it should be reset.
• These mechanisms are learned and they address the concerns
listed above.
• For instance, if the first token is of great importance, it will learn
not to update the cell state after the first observation.
• Likewise, it will learn to skip irrelevant temporary observations.
• Last, it will learn to reset the latent state whenever needed.
• Hence, compared with the basic RNN network which has only
one state vector ht,
– LSTM adds a new state vector Ct called cell state, and
– It introduces a gate control mechanism, which controls the
forgetting and updating of information on the cell state Ct.
Input Gate, Forget Gate, and Output Gate
• The data feeding into the LSTM gates are the input and the hidden
state of the previous time step.
• Three fully connected layers with sigmoid activation functions
compute the values of
– the input, forget, and output gates.
• Because of the sigmoid activation, all values of the three gates are
in the range of (0, 1).
• Additionally, we require an input node which is computed with a
tanh activation function.
• Generally:
– the forget gate determines whether to keep the current value of the
memory cell or flush it. The forget gate controls which parts of the
long-term state should be erased.
– the input gate determines how much of the input value should be
added to the memory cell internal state. It controls which parts of
input should be added to the long-term state.
– the output gate determines whether the memory cell should influence
the hidden output at the current time step.
• Mathematically, suppose that there are h hidden units, and the
number of inputs is d.
• Thus, the input is Xt ∈ Rd and the hidden state of the previous
time step is Ht-1 ∈ Rh.
• The gates at time step t are defined as follows: the input gate is
It ∈ Rh, the forget gate is Ft ∈ Rh, and the output gate is Ot ∈ Rh
.

Figure Computing the input gate, the forget gate, and the output gate in LSTM
5.5. Long Short-Term Memory…

5.5. Long Short-Term Memory…

5.5. Long Short-Term Memory…
• A quick illustration of the input node is shown in the figure
below.

Figure Computing the input node in an LSTM model.



5.5. Long Short-Term Memory…
• We thus arrive at the flow diagram.

Figure Computing the memory cell internal state in an LSTM model



5.5. Long Short-Term Memory…
• Note that a memory cell can accrue information across many time
steps without impacting the rest of the network as long as the
output gate takes values close to 0.
• It then suddenly impact the network at a subsequent time step as
soon as the output gate flips from values close to 0 to values close
to 1.

Figure Computing the hidden state in an LSTM model.



5.5. Long Short-Term Memory…

Figure LSTM cell


5.6. Gated Recurrent Units (GRU)
• GRU was introduced by Cho, et al. in 2014 and it aims to solve
the vanishing gradient problem which comes with a standard
RNN.
• GRU is the newer generation of RNN and is similar to LSTM.
• They are a simplified version of LSTMs.
• GRU got rid of the cell state and uses the hidden state to transfer
information.
• It also only has two gates:
– reset gate and update gate
• Basically, these are two vectors which decide what information
should be passed to the output.
• The special thing about them is that they can be trained to keep
information from long ago, without washing it through time or
remove information which is irrelevant to the prediction.
5.6. Gated Recurrent Units (GRU)
• GRU uses the update gate and reset gate.
• These two vectors decide what information should be passed to
the output.
• The special thing about them is that they can be trained to keep
information from long ago, without washing it through time or
remove information which is irrelevant to the prediction.
• During BPTT, the long products of matrices can lead to vanishing or
exploding gradients.
• Let us think about what such gradient anomalies mean in practice:
– We might encounter a situation where an early observation is
highly significant for predicting all future observations. We would
like to have some mechanisms for storing vital early information
in a memory cell. Without such a mechanism, we will have to
assign a very large gradient to this observation, since it affects all
the subsequent observations.
5.6. Gated Recurrent Units…
– We might encounter situations where some tokens carry no
pertinent observation. For instance, an HTML tag that is
irrelevant for assessing the sentiment conveyed on the page.
We would like to have some mechanism for skipping such
tokens in the latent state representation.
– We might encounter situations where there is a logical break
between parts of a sequence. For instance, there might be a
transition between chapters in a book, or a transition between
a bear and a bull market for securities. In this case it would be
nice to have a means of resetting our internal state
representation.
5.6. Gated Recurrent Units…
• A number of methods have been proposed to address this.
• The GRU is a slightly more streamlined variant that often offers
comparable performance and is significantly faster to compute.

Gated Hidden State


• The key distinction between vanilla RNNs and GRUs is that the
latter support gating of the hidden state.
• We have dedicated mechanisms for when a hidden state should be
updated and also when it should be reset.
• These mechanisms are learned and they address the concerns
listed above.
• It allows the network to control the flow of information and
manage long-term dependencies in sequential data.
• Unlike LSTMs, GRUs simplify the architecture by using fewer
gates.
5.6. Gated Recurrent Units…
• For instance, if the first token is of great importance it will learn
not to update the hidden state after the first observation.
• Likewise, it will learn to skip irrelevant temporary observations.
• Last, it will learn to reset the latent state whenever needed.

Reset Gate and Update Gate


• We engineer the gates to be vectors with entries in (0, 1) such that
we can perform convex combinations.
• For instance, a reset gate would allow us to control how much of
the previous state we might still want to remember.
• Likewise, an update gate would allow us to control how much of
the new state is just a copy of the old state.
• The update gate controls how much of the past hidden state
should be carried forward to the next time step.
• It determines whether the unit should update its hidden state with
new information or retain the existing state.
• The reset gate controls how much of the past hidden state should
be forgotten before incorporating new input.
• It allows the model to reset parts of the hidden state when
processing new inputs, helping it capture short-term dependencies
effectively.
• The outputs of two gates are given by two fully-connected layers
with a sigmoid activation function.
5.6. Gated Recurrent Units…

5.6. Gated Recurrent Units…

5.6. Gated Recurrent Units…
• The result is a candidate since we still need to incorporate the
action of the update gate.
• Comparing with vanilla RNN, now the influence of the previous
states can be reduced with the elementwise multiplication of Rt
and Ht−1.
• Whenever the entries in the reset gate Rt are close to 1, we
recover a vanilla RNN.
• For all entries of the reset gate Rt that are close to 0, the
candidate hidden state is the result of an MLP with Xt as the
input.
• Any pre-existing hidden state is thus reset to defaults.
5.6. Gated Recurrent Units…

• These designs can help us cope with the vanishing gradient
problem and better capture dependencies for sequences with large
time step distances.
• For instance, if the update gate has been close to 1 for all the time
steps of an entire subsequence, the old hidden state at the time
step of its beginning will be easily retained and passed to its end,
regardless of the length of the subsequence.

Comparing LSTM and GRU
• GRUs are a simplified version of LSTMs.
• They use a single “update gate” to control the flow of information
into the cell, rather than the three gates used in LSTMs.
• This makes GRUs easier to train and faster to run than LSTMs,
but they may not be as effective at storing and accessing
long-term dependencies.
• Both LSTMs and GRUs are in a wide range of tasks, including
language translation, speech recognition, and time series
forecasting.
• In general:
– LSTMs tend to be more effective at tasks that require the
network to store and access long-term dependencies.
– GRUs are more effective at tasks that require the network to
learn quickly and adapt to new inputs.
• LSTM and GRU both use gates to control the flow of information.
• However, LSTM has three gates, while GRU has only two gates.
• This makes LSTM more expressive than GRU, but also more
complex.
• Moreover, LSTM has a separate memory cell, while GRU
combines the memory cell and the hidden state into a single
vector.
• LSTM strengths:
– First, it can effectively learn long-term dependencies, thanks to
its memory cell and forget gate.
– Second, it can handle variable-length sequences, which is
important in many real-world applications.
– Third, it can prevent overfitting by using dropout or recurrent
dropout.
• LSTM weaknesses:
– First, it is more complex than the standard RNN and requires
more computational resources.
– Second, it is prone to overfitting if the dataset is small or noisy.
– Third, it may suffer from gradient exploding if the weights are
not properly initialized or if the learning rate is too high.
• GRU also has several strengths that make it a competitive
alternative to LSTM.
– First, it is simpler and more computationally efficient than
LSTM. This makes it faster to train and easier to deploy.
– Second, it requires less data to train and can handle noisy
datasets better.
– Third, it can learn complex patterns in the data without
overfitting, thanks to its update and reset gates.
• GRU weaknesses:
– First, it may not be as effective as LSTM in learning long-term
dependencies, especially in complex tasks.
– Second, it may suffer from gradient vanishing if the dataset is
too long or the weights are not properly initialized.
5.6. Gated Recurrent Units…
• In terms of performance comparison, several studies have
compared the performance of LSTM and GRU on various tasks,
such as
– speech recognition,
– language modeling, and
– sentiment analysis.
• The results are mixed, with some studies showing that LSTM
outperforms GRU and others showing the opposite.
• However, most studies agree that their performance depends on
the specific task and dataset.
5.7. Bidirectional RNN

5.7. Bidirectional RNN…

5.7. Bidirectional RNN…

• The outputs of the forward and backward pass are concatenated.
• Other simple ways to combine the forward and backward
contexts include element-wise addition or multiplication.
• The output at each step in time thus captures information to the
left and to the right of the current input.
• In sequence labeling applications, these concatenated outputs can
serve as the basis for a local labeling decision.

Figure A bidirectional RNN


5.7. Bidirectional RNN…
• BRNN have also proven to be quite effective for sequence
classification.
• In simple RNN, for sequence classification, we use the final
hidden state of the RNN as the input to a subsequent feedforward
classifier.
• A difficulty with this approach is that the final state naturally
reflects more information about the end of the sentence than its
beginning.
• BRNNs provide a simple solution to this problem by simply
combining the final hidden states from the forward and backward
passes and use that as input for follow on processing.
• Again, concatenation is a common approach to combining the two
outputs but element-wise summation, multiplication or averaging
are also used.
5.7. Bidirectional RNN…

Figure A bidirectional RNN for sequence classification


5.8. Sequence to Sequence Models
• An RNN can be trained to map an input sequence to an output
sequence which is not necessarily of the same length.
• This comes up in many applications, such as machine translation,
question answering, and speech recognition where the input and
output sequences in the training set are generally not of the same
length.
• Sequence to Sequence (often abbreviated to seq2seq) models is a
special class of RNN that we typically use to solve complex
language problems like
– Machine Translation
– Question Answering
– Creating Chatbots
– Text Summarization, etc.
5.8. Sequence to Sequence Models…
• A typical sequence to sequence model has two parts:
– an encoder and
– a decoder
• Both the parts are practically two different neural network models
combined into one giant network.

Figure high level architecture of encoder-decoder model for NMT


5.8. Sequence to Sequence Models…
Encoder
• Multiple RNN cells can be stacked together to form an encoder.
• The encoder RNN reads each input sequentially.
• For every timestep t, the hidden state ht is updated according to
the input at that timestep Xt.
• After all the inputs are read by encoder model, the final hidden
state of the model represents the context of the whole input
sequence.
• For example, consider the input sequence “I am a Student” to be
encoded.
• There will be totally 4 timesteps for the encoder model.
• At each time step, the hidden state h will be updated using the
previous hidden state and the current input.
Context Vector
• The context is the final hidden state produced from the encoder
model.
• This vector aims to encapsulate the information for all input
elements in order to help the decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.
• We often call the input to the decoder RNN the “context.”
• In encoder, we want to produce a representation of this context, C.
• The context C might be a vector or sequence of vectors that
summarize the input sequence X = (x1,…, xn).

h1 h2 hn

x1 x2 x… xn

Figure encoder
Decoder
• The decoder generates the output sequence by predicting the next
output Yt given the context vector C.
• The input for the decoder is the context vector obtained at the end
of encoder model.
• Each time step will have three inputs: hidden vector from previous
time step ht-1, the previous time output yt-1, and original context
vector C.
• At the first time step, the output vector of encoder and the random
symbol <START>, empty hidden state h0 will be given as input
• The output obtained will be y1 and updated hidden state h1.
• The second time step will have the updated hidden state h1 and the
previous output y1 and original context vector C as current inputs,
and produces the hidden vector h2 and output y2.
• The outputs that occurred at each timestep of the decoder is the
actual output.
• The model will predict the output until the <END> symbol occurs.
5.8. Sequence to Sequence Models…
• The idea of encoder-decoder is as follows:
– An encoder or input RNN processes the input sequence. The
encoder emits the context C, usually as a simple function of its
final hidden state.
– A decoder is conditioned on that fixed-length vector C to
generate the output sequence Y = (y1, … ,yn).

Figure sequence to sequence model for machine translation


5.8. Sequence to Sequence Models…
• In a sequence-to-sequence architecture, the two RNNs are trained
jointly to maximize the average of log P(y1,…,yn|x1, …, xn) over
all the pairs of x and y sequences in the training set.
• The last state hn of the encoder RNN is typically used as a
representation C of the input sequence that is provided as input to
the decoder RNN.
• Sequence-to-sequence models have a lot of applications some of
which are:
– Machine translation
– Text summarization
– Sentiment analysis
– Speech recognition
– Video captioning
5.8. Sequence to Sequence Models…
• One limitation of this architecture is when the context C output by
the encoder RNN has a dimension that is too small to properly
summarize a long sequence.
• This phenomenon was observed by Bahdanau et al. (2015) in the
context of machine translation.
• They proposed to make C a variable-length sequence rather than a
fixed-size vector.
• Additionally, they introduced an attention mechanism that learns
to associate elements of the sequence C to elements of the output
sequence.
5.8. Applications of RNN
• RNN has vast applications areas some of which are discussed below.
i. Image Captioning
• Let’s say we have an image for which we need a textual description.
• So, we have a single input – the image, and a series or sequence of
words as output.
• Here the image might be of a fixed size, but the output is a
description of varying lengths.
• Image captioning is a task in computer vision and NLP that involves
generating a textual description of an image.
• It is teaching a computer to "look" at an image and then describe
what it sees in natural language, much like a human would.
• The goal is to create a meaningful and contextually accurate caption
that describes the content, objects, actions, and relationships depicted
in the image.
5.8. Applications of RNN…
5.8. Applications of RNN…
ii. Sentiment Classification
• This can be a task of simply classifying opinions into positive and
negative sentiment.
• So here the input would be a tweet of varying lengths, while
output is of a fixed type and size.
5.8. Applications of RNN…
iii. Machine Translation
• Given an input in one language, RNNs can be used to translate the
input into different languages as output.
iv. Text Summarization
• Text summarization is a process of creating a subset that
represents the most important and relevant information of the
original content.
• For example, text summarization can be useful for someone who
wants to read the summary instead of the whole content.
• It will save time if the original content was not useful for the
reader.
v. Chatbots
• A chatbot is a computer program that simulates and processes human
conversation.
• Chatbots are often simple as rudimentary programs that answer an easy
query with a single-line response or as complex as digital assistants that
learn and evolve from their surroundings and gather and process
information.
• For example, most online customer services have a chatbot that responds
to queries in a question-answer format.
5.8. Applications of RNN…
vi. Speech Recognition
• This is also known as Automatic Speech Recognition (ASR), and
it is capable of converting human speech into text format.
• Speech recognition primarily focuses on converting voice data
into text.
• Speech recognition technologies that are used on a daily basis by
various users include Alexa, Cortana, Google Assistant, and Siri.
5.8. Applications of RNN…
vii. Video Tagging
• The contents and actions of the video can be analyzed and the
videos can be given certain tags which are automated.
• These tags describe the video.
5.8. Applications of RNN…
viii. Prediction problems
• After being trained on historical time-stamped data, an RNN can
be used to create a time series prediction model that predicts the
future outcome.
• The stock market is a good example.
• You can use stock market data to build a machine learning model
that can forecast future stock prices based on what the model
learns from historical data.
• This can assist investors in making data-driven investment
decisions.
• Weather Forecasting is another example of prediction problem.
• It is predicting future weather conditions based on past data.
5.8. Applications of RNN…
• Energy Consumption Prediction: Estimating future energy usage
for efficient resource management.
• Anomaly Detection: Identifying unusual patterns in time series
data, such as fraud detection in financial transactions.

Fig Stock Price Prediction


5.8. Applications of RNN…
• ix. Named Entity Recognition (NER)
• NER is identifying and classifying entities (e.g., names, dates,
locations) in text.
• NER is an NLP task that identifies and classifies named entities in
a text such as people, organizations, locations, dates, etc.
• RNNs are particularly well-suited for NER because they can
capture contextual information from the surrounding words,
which is crucial for accurately identifying and classifying entities.
• RNN-based models, especially Bi-LSTM + CRF architectures,
have significantly improved NER performance.
• Example: Barack Obama visited Berlin last week to attend a
climate change conference.
• Output with NER Annotations:
– Barack Obama: PERSON; Berlin: LOCATION; last week: DATE;
climate change conference: EVENT
5.8. Applications of RNN…
• x. Music Generation
• Composing music or generating melodies based on patterns in
existing music.
• In music generation, the model learns patterns from existing
music and generates new compositions.
• RNNs are particularly suited for this task because they can
process and generate sequential data, such as musical notes,
chords, and rhythms.
• Music often has long-term structures (e.g., motifs, themes) that
are difficult for basic RNNs to capture. LSTMs and GRUs help
mitigate this issue.

You might also like