Unit- 5 Deep Learning (1)
Unit- 5 Deep Learning (1)
Several neural networks can help solve different business problems. Let’s look at a few of
them.
Convolutional Neural Network: Used for object detection and image classification.
RNN: Used for speech recognition, voice recognition, time series prediction, and
natural language processing.
where we see that the state now contains information about the whole past sequence.
Recurrent neural networks can be built in many different ways. Much as almost any function
can be considered a feedforward neural network, essentially any function involving
recurrence can be considered a recurrent neural network.
Many recurrent neural networks used below Eq or a similar equation to define the values of
their hidden units. To indicate that the state is the hidden units of the network, we now
rewrite Eq. using the variable to represent the state:
typical RNNs will add extra architectural features such as output layers that read information
out of the state to make predictions.
When the recurrent network is trained to perform a task that requires predicting the future
from the past, the network typically learns to use as a kind of lossy summary of the task-
relevant aspects of the past sequence of inputs up to .
The most demanding situation is when we ask to be rich enough to allow one to
approximately recover the input sequence, as in auto-encoder frameworks.
Eq. can be drawn in two different ways. One way to draw the RNN is with a diagram
containing one node for every component that might exist in a physical
implementation of the model, such as a biological neural network.
In this view, the network defines a circuit that operates in real time, with physical
parts whose current state can influence their future state.
The other way to draw the RNN is as an unfolded computational graph, in which each
component is represented by many different variables, with one variable per time step,
representing the state of the component at the point in time.
Each variable for each time step is drawn as a separate node of the computational
graph, as in right of above Fig.
What we call unfolding is the operation that maps a circuit as in the left side of the
figure to a computational graph with repeated pieces as in the right side. The
unfolded graph now has a size that depends on the sequence length.
The function takes the whole past sequence as input and produces the current
state, but the unfolded recurrent structure allows us to factorize into repeated
application of a function . The unfolding process thus introduces two major
advantages:
Regardless of the sequence length, the learned model always has the same input size,
because it is specified in terms of transition from one state to another state, rather than
specified in terms of a variable-length history of states.
It is possible to use the same transition function with the same parameters at every
time step.
These two factors make it possible to learn a single model that operates on all time
steps and all sequence lengths, rather than needing to learn a separate model for
all possible time steps. Learning a single, shard model allows generalization to
sequence lengths that did not appear in the training set, and allows the model to be
estimated with far fewer training examples than would be required without parameter
sharing.
Both the recurrent graph and the unrolled graph have their uses. The recurrent graph
is succinct. The unfolded graph provides an explicit description of which
Dept: CAI Page 3 of 15
NB-SEAGI DL(R20)-Unit-5
computations to perform. The unfolded graph also helps to illustrate the idea of
information flow forward in time (computing outputs and losses) and backward in
time (computing gradients) by explicitly showing the path along which this
information flows.
Hence these three layers can be joined together such that the weights and bias of
all the hidden layers are the same, in a single recurrent layer.
2. Then calculate its current state using a set of current input and the previous state.
4. One can go as many time steps according to the problem and join the information
from all the previous states.
5. Once all the time steps are completed the final current state is used to calculate the
output.
6. The output is then compared to the actual output i.e the target output and the error is
generated.
7. The error is then back-propagated to the network to update the weights and hence the
network (RNN) is trained.
Why RNN?
RNN were created because there were a few issues in the feed-forward neural network:
An RNN can handle sequential data, accepting the current input data, and previously
received inputs. RNNs can memorize previous inputs due to their internal memory.
Advantages of RNN
1. An RNN remembers each and every piece of information through time. It is useful in
time series prediction only because of the feature to remember previous inputs as
well. This is called Long Short Term Memory.
2. Recurrent neural networks are even used with convolutional layers to extend the
effective pixel neighborhood.
Disadvantages of RNN
2. They are slow because they processed sequentially. TO calculate current state we
must know previous state.So,Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation function.
Application of RNN
2. Speech Recognition
3. Machine Translation
Application of RNN
Bidirectional RNN
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in
the forward direction, while the other processes it in the reverse direction. The outputs of
these two RNNs are then combined in some way to produce the final output.
One common way to combine the outputs of the forward and reverse RNNs is to concatenate
them. Still, other methods, such as element-wise addition or multiplication, can also be used.
The choice of combination method can depend on the specific task and the desired properties
of the final output.
This means the network can only use information from earlier time steps when
making predictions at later time steps.
This can be limiting, as the network may not capture important contextual information
relevant to the output prediction.
For example, in natural language processing tasks, a uni-directional RNN may not
accurately predict the next word in a sentence if the previous words provide important
context for the current word.
Consider an example where we could use the recurrent network to predict the masked word
in a sentence.
In the first sentence, the answer could be fruit, company, or phone. But it can not be a fruit in
the second and third sentences.
A Recurrent Neural Network that can only process the inputs from left to right may not
accurately predict the right answer for sentences discussed above.
This can be useful for tasks such as language processing, where understanding the context of
a word or phrase can be important for making accurate predictions.
In general, bidirectional RNNs can help improve a model's performance on various sequence-
based tasks.
2. .Another one that processes the input sequence from right to left.
These two RNNs are typically called forward and backward RNNs, respectively.
The Encoder-Decoder architecture with recurrent neural networks has become an effective
and standard approach for both neural machine translation (NMT) and sequence-to-sequence
(seq2seq) prediction in general.
The key benefits of the approach are the ability to train a single end-to-end model directly on
source and target sentences and the ability to handle variable length input and output
sequences of text.
A decoder network then used this internal representation to output words until the end of
sequence token was reached. LSTM networks were used for both the encoder and decoder.
The most fundamental building blocks or components used to build the encoder-
decoder architecture is neural network. Different kind of neural networks including RNN,
LSTM, CNN, can be used based on encoder decoder architecture.
In this architecture, the input data is first fed through what’s called as an encoder network.
The encoder network maps the input data into a numerical representation that captures the
important information from the input. Thee numerical representation of the input data is also
called as hidden state. The numerical representation (hidden state) is then fed into what’s
called as the decoder network. The decoder network generates the output by generating one
element of the output sequence at a time. The following picture represents the encoder
decoder architecture as explained here. Note that both input and output sequence of data can
be of varying length as shown in the picture below.
A popular form of neural network architecture called as autoencoder is a type of the encoder
decoder architecture. An autoencoder is a type of neural network architecture that uses an
encoder to compress an input into a lower-dimensional representation, and a decoder to
reconstruct the original input from the compressed representation. It is primarily used for
unsupervised learning and data compression. The other types of encoder-decoder architecture
can be used for supervised learning tasks, such as machine translation, image captioning, and
speech recognition. In this architecture, the encoder maps the input to a fixed-length
representation, which is then passed to the decoder to generate the output. So while the
encoder-decoder architecture and autoencoder have similar components, their main purposes
and applications differ.
The computation in most RNNs can be decomposed into three blocks of parameters and
associated transformations:
2. from the previous hidden state to the next hidden state, and
With the RNN architecture of above fig., each of these three blocks is associated with a single
weight matrix. In other words, when the network is unfolded, each of these corresponds to a
shallow transformation.
Graves et al. were the first to show a significant benefit of decomposing the state of an RNN
into multiple layers as in below Fig.(left).
We can think of the lower layers in the hierarchy depicted in above Fig (a) as playing a role
in transforming the raw input into a representation that is more appropriate, at the higher
levels of the hidden state.
In general, it is easier to optimize shallower architectures, and adding the extra depth of
above Fig. (b) makes the shortest path from a variable in time step to a variable in time
step become longer.
However, this can be mitigated by introducing skip connections in the hidden-to-hidden path,
as illustrated in above Fig.(c).
Recursive Neural Networks (RvNNs) are deep neural networks used for natural language
processing. We get a Recursive Neural Network when the same weights are applied
recursively on a structured input to obtain a structured prediction.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn
detailed and structured information. With RvNN, you can get a structured prediction by
recursively applying the same set of weights on structured inputs. The word recursive
indicates that the neural network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data.
The tree structure means combining child nodes and producing parent nodes. Each child-
parent bond has a weight matrix, and similar children have the same weights. The number of
children for every node in the tree is fixed to enable it to perform recursive operations and
use the same weights. RvNNs are used when there's a need to parse an entire sentence.
Neural network optimization face a difficulty when computational graphs become deep, e.g.,
RNNs that repeatedly apply the same operation at each time step of a long
temporal sequence
Gradients propagated over many stages tend to either vanish (most of the time) or explode
(damaging optimization)
The difficulty with long-term dependencies arise from exponentially smaller weights given to
long-term interactions (involving multiplication of many Jacobians)
Echo state network is a type of Recurrent Neural Network, part of the reservoir computing
framework, which has the following particularities:
the weights between the input -the hidden layer ( the ‘reservoir’) : Win and also the
weights of the ‘reservoir’: Wr are randomly assigned and not trainable
the weights of the output neurons (the ‘readout’ layer) are trainable and can be learned
so that the network can reproduce specific temporal patterns
the hidden layer (or the ‘reservoir’) is very sparsely connected (typically < 10%
connectivity)
the reservoir architecture creates a recurrent non linear embedding (H on the image
below) of the input which can be then connected to the desired output and these final
weights will be trainable
LSTM
Bidirectional LSTM
Need of LSTM
LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to
handle situations where RNNs fail.
It fails to store information for a longer period of time. At times, a reference to certain
information stored quite a long time ago is required to predict the current output. But
RNNs are absolutely incapable of handling such “long-term dependencies”.
There is no finer control over which part of the context needs to be carried forward
and how much of the past needs to be ‘forgotten’.
Other issues with RNNs are exploding and vanishing gradients (explained later)
which occur during the training process of a network through backtracking.
Thus, Long Short-Term Memory (LSTM) was brought into the picture. It has been so
designed that the vanishing gradient problem is almost completely removed, while the
training model is left unaltered. Long-time lags in certain problems are bridged using LSTMs
which also handle noise, distributed representations, and continuous values. With LSTMs,
there is no need to keep a finite number of states from beforehand as required in the hidden
Markov model (HMM). LSTMs provide us with a large range of parameters such as learning
rates, and input and output biases.
Structure of LSTM
The basic difference between the architectures of RNNs and LSTMs is that the hidden layer
of LSTM is a gated unit or gated cell. It consists of four layers that interact with one another
in a way to produce the output of that cell along with the cell state. These two things are then
passed onto the next hidden layer. Unlike RNNs which have got only a single neural net layer
of tanh, LSTMs comprise three logistic sigmoid gates and one tanh layer. Gates have been
introduced in order to limit the information that is passed through the cell. They determine
which part of the information will be needed by the next cell and which part is to be
discarded. The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means
‘include all’.
Information is retained by the cells and the memory manipulations are done by the gates.
There are three gates which are explained below:
Forget Gate The information that is no longer useful in the cell state is removed with the
forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of bias. The
resultant is passed through an activation function which gives a binary output. If for a
particular cell state, the output is 0, the piece of information is forgotten and for output 1, the
information is retained for future use.
Input gate The addition of useful information to the cell state is done by the input gate. First,
the information is regulated using the sigmoid function and filter the values to be
remembered similar to the forget gate using inputs h_t-1 and x_t. Then, a vector is created
using the tanh function that gives an output from -1 to +1, which contains all the possible
values from h_t-1 and x_t. At last, the values of the vector and the regulated values are
multiplied to obtain useful information.
Output gate The task of extracting useful information from the current cell state to be
presented as output is done by the output gate. First, a vector is generated by applying the
tanh function on the cell. Then, the information is regulated using the sigmoid function and
filtered by the values to be remembered using inputs h_t-1 and x_t. At last, the values of the
Dept: CAI Page 14 of 15
NB-SEAGI DL(R20)-Unit-5
vector and the regulated values are multiplied to be sent as an output and input to the next
cell.
Bidirectional LSTM
Bidirectional LSTM or BiLSTM is a term used for a sequence model which contains two
LSTM layers, one for processing input in the forward direction and the other for processing
in the backward direction. It is usually used in NLP-related tasks. The intuition behind this
approach is that by processing data in both directions, the model is able to better understand
the relationship between sequences (e.g. knowing the following and preceding words in a
sentence).
To better understand this let us see an example. The first statement is “Server can you bring
me this dish” and the second statement is “He crashed the server”. In both these statements,
the word server has different meanings and this relationship depends on the following and
preceding words in the statement. The bidirectional LSTM helps the machine to understand
this relationship better than compared with unidirectional LSTM. This ability of BiLSTM
makes it a suitable architecture for tasks like sentiment analysis, text classification, and
machine translation.