0% found this document useful (0 votes)
51 views

Unit- 5 Deep Learning (1)

Uploaded by

edigadinesh2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Unit- 5 Deep Learning (1)

Uploaded by

edigadinesh2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

NB-SEAGI DL(R20)-Unit-5

DEEP LEARNING (20A05703c)


UNIT V
Sequence Modeling: Recurrent and Recursive Nets: Unfolding Computational Graphs,
Recurrent Neural Networks, Bidirectional RNNs, Encoder-Decoder Sequence-to-Sequence
Architectures, Deep Recurrent Networks, Recursive Neural Networks, Echo State Networks,
LSTM, Gated RNNs, Optimization for Long-Term Dependencies, Auto encoders, Deep
Generative Models.

What Is a Neural Network? A Neural Network consists of different layers connected to


each other, working on the structure and function of a human brain. It learns from huge
volumes of data and uses complex algorithms to train a neural net.

Several neural networks can help solve different business problems. Let’s look at a few of
them.

 Feed-Forward Neural Network: Used for general Regression and Classification


problems.

 Convolutional Neural Network: Used for object detection and image classification.

 Deep Belief Network: Used in healthcare sectors for cancer detection.

 RNN: Used for speech recognition, voice recognition, time series prediction, and
natural language processing.

Unfolding Computational Graphs

A computational graph is a way to formalize the structure of a set of computations, such as


those involved in mapping inputs and parameters to outputs and loss. In this section we
explain the idea of unfolding a recursive or recurrent computation into a computational graph
that has a repetitive structure, typically corresponding to a chain of events. Unfolding this
graph results in the sharing of parameters across a deep network structure.

For example, consider the classical form of a dynamical system:

where s(t ) is called the state of the system.

Another example, let us consider a dynamical system driven by an external signal ,

where we see that the state now contains information about the whole past sequence.

Dept: CAI Page 1 of 15


NB-SEAGI DL(R20)-Unit-5

Recurrent neural networks can be built in many different ways. Much as almost any function
can be considered a feedforward neural network, essentially any function involving
recurrence can be considered a recurrent neural network.

Many recurrent neural networks used below Eq or a similar equation to define the values of
their hidden units. To indicate that the state is the hidden units of the network, we now
rewrite Eq. using the variable to represent the state:

typical RNNs will add extra architectural features such as output layers that read information
out of the state to make predictions.

When the recurrent network is trained to perform a task that requires predicting the future
from the past, the network typically learns to use as a kind of lossy summary of the task-
relevant aspects of the past sequence of inputs up to .

 This summary is in general necessarily lossy, since it maps an arbitrary length


sequence to fixed length vector .
 Depending on the training criterion, this summary might selectively keep some
aspects of the past sequence with more precision than other aspects.

Dept: CAI Page 2 of 15


NB-SEAGI DL(R20)-Unit-5

The most demanding situation is when we ask to be rich enough to allow one to
approximately recover the input sequence, as in auto-encoder frameworks.

Eq. can be drawn in two different ways. One way to draw the RNN is with a diagram
containing one node for every component that might exist in a physical
implementation of the model, such as a biological neural network.

 In this view, the network defines a circuit that operates in real time, with physical
parts whose current state can influence their future state.

The other way to draw the RNN is as an unfolded computational graph, in which each
component is represented by many different variables, with one variable per time step,
representing the state of the component at the point in time.

 Each variable for each time step is drawn as a separate node of the computational
graph, as in right of above Fig.

What we call unfolding is the operation that maps a circuit as in the left side of the
figure to a computational graph with repeated pieces as in the right side. The
unfolded graph now has a size that depends on the sequence length.

We can represent the unfolded recurrence after steps with a function :

The function takes the whole past sequence as input and produces the current
state, but the unfolded recurrent structure allows us to factorize into repeated
application of a function . The unfolding process thus introduces two major
advantages:

 Regardless of the sequence length, the learned model always has the same input size,
because it is specified in terms of transition from one state to another state, rather than
specified in terms of a variable-length history of states.
 It is possible to use the same transition function with the same parameters at every
time step.

These two factors make it possible to learn a single model that operates on all time
steps and all sequence lengths, rather than needing to learn a separate model for
all possible time steps. Learning a single, shard model allows generalization to
sequence lengths that did not appear in the training set, and allows the model to be
estimated with far fewer training examples than would be required without parameter
sharing.

Both the recurrent graph and the unrolled graph have their uses. The recurrent graph
is succinct. The unfolded graph provides an explicit description of which
Dept: CAI Page 3 of 15
NB-SEAGI DL(R20)-Unit-5

computations to perform. The unfolded graph also helps to illustrate the idea of
information flow forward in time (computing outputs and losses) and backward in
time (computing gradients) by explicitly showing the path along which this
information flows.

Recurrent Neural Network

 Recurrent Neural Network(RNN) is a type of Neural Network where the output


from the previous step are fed as input to the current step. In
traditional neural networks, all the inputs and outputs are
independent of each other, but in cases like when it is required to
predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous
words. Thus RNN came into existence, which solved this issue
with the help of a Hidden Layer.

 The main and most important feature of RNN is Hidden state,


which remembers some information about a sequence.

 RNN have a “memory” which remembers all information about


what has been calculated. It uses the same parameters for each
input as it performs the same task on all the inputs or hidden layers
to produce the output. This reduces the complexity of parameters,
unlike other neural networks.

Now the RNN will do the following:

RNN converts the independent activations into dependent activations by


providing the same weights and biases to all the layers, thus reducing the
complexity of increasing parameters and memorizing each previous output by
giving each output as input to the next hidden layer.

Hence these three layers can be joined together such that the weights and bias of
all the hidden layers are the same, in a single recurrent layer.

The formula for calculating current state:

Dept: CAI Page 4 of 15


NB-SEAGI DL(R20)-Unit-5

Training through RNN

1. A single-time step of the input is provided to the network.

2. Then calculate its current state using a set of current input and the previous state.

3. The current ht becomes ht-1 for the next time step.

4. One can go as many time steps according to the problem and join the information
from all the previous states.

5. Once all the time steps are completed the final current state is used to calculate the
output.

6. The output is then compared to the actual output i.e the target output and the error is
generated.

7. The error is then back-propagated to the network to update the weights and hence the
network (RNN) is trained.

Why RNN?

RNN were created because there were a few issues in the feed-forward neural network:

 Cannot handle sequential data

 Considers only the current input

 Cannot memorize previous inputs

The solution to these issues is the RNN.

 An RNN can handle sequential data, accepting the current input data, and previously
received inputs. RNNs can memorize previous inputs due to their internal memory.

Advantages of RNN

Dept: CAI Page 5 of 15


NB-SEAGI DL(R20)-Unit-5

1. An RNN remembers each and every piece of information through time. It is useful in
time series prediction only because of the feature to remember previous inputs as
well. This is called Long Short Term Memory.

2. Recurrent neural networks are even used with convolutional layers to extend the
effective pixel neighborhood.

3. Model size does not grow with input size.

Disadvantages of RNN

1. Gradient vanishing and exploding problems.

2. They are slow because they processed sequentially. TO calculate current state we
must know previous state.So,Training an RNN is a very difficult task.

3. It cannot process very long sequences if using tanh or relu as an activation function.

Application of RNN

1. Language Modelling and Generating Text

2. Speech Recognition

3. Machine Translation

4. Image Recognition, Face detection

5. Time series Forecasting

Application of RNN

1. Machine Translation We make use of


Recurrent Neural Networks in the translation
engines to translate the text from one to another
language. They do this with the combination of
other models like LSTM (Long short-term
memory)s.

2. Speech Recognition Recurrent Neural Networks has


replaced the traditional speech recognition models that
made use of Hidden Markov Models. These Recurrent
Neural Networks, along with LSTMs, are better poised at
classifying speeches and converting them into text
without loss of context.

Dept: CAI Page 6 of 15


NB-SEAGI DL(R20)-Unit-5

3. Sentiment Analysis We make use of sentiment analysis to positivity, negativity, or


the neutrality of the sentence. Therefore, RNNs are most adept at handling data
sequentially to find sentiments of the sentence.

4. Automatic Image Tagger RNNs, in conjunction with convolutional neural networks,


can detect the images and provide their descriptions in the form of tags. For example,
a picture of a fox jumping over the fence is better explained appropriately using
RNNs.

Bidirectional RNN

A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural network


(RNN) that processes input data in both forward and backward directions. The goal of a Bi-
RNN is to capture the contextual dependencies in the input data by processing it in both
directions, which can be useful in various natural language processing (NLP) tasks.

In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in
the forward direction, while the other processes it in the reverse direction. The outputs of
these two RNNs are then combined in some way to produce the final output.

One common way to combine the outputs of the forward and reverse RNNs is to concatenate
them. Still, other methods, such as element-wise addition or multiplication, can also be used.
The choice of combination method can depend on the specific task and the desired properties
of the final output.

Need of Bidirectional RNN

 A uni-directional recurrent neural network (RNN) processes input sequences in a


single direction, either from left to right or right to left.
Dept: CAI Page 7 of 15
NB-SEAGI DL(R20)-Unit-5

 This means the network can only use information from earlier time steps when
making predictions at later time steps.

 This can be limiting, as the network may not capture important contextual information
relevant to the output prediction.

 For example, in natural language processing tasks, a uni-directional RNN may not
accurately predict the next word in a sentence if the previous words provide important
context for the current word.

Consider an example where we could use the recurrent network to predict the masked word
in a sentence.

1. Apple is my favorite _____.

2. Apple is my favourite _____, and I work there.

3. Apple is my favorite _____, and I am going to buy one.

In the first sentence, the answer could be fruit, company, or phone. But it can not be a fruit in
the second and third sentences.

A Recurrent Neural Network that can only process the inputs from left to right may not
accurately predict the right answer for sentences discussed above.

This can be useful for tasks such as language processing, where understanding the context of
a word or phrase can be important for making accurate predictions.

In general, bidirectional RNNs can help improve a model's performance on various sequence-
based tasks.

This means that the network has two separate RNNs:

1. One that processes the input sequence from left to right

2. .Another one that processes the input sequence from right to left.

These two RNNs are typically called forward and backward RNNs, respectively.

Dept: CAI Page 8 of 15


NB-SEAGI DL(R20)-Unit-5

Difference between Bidirectional RNN and RNN

Encoder-Decoder Sequence-to-Sequence Architectures,

The Encoder-Decoder architecture with recurrent neural networks has become an effective
and standard approach for both neural machine translation (NMT) and sequence-to-sequence
(seq2seq) prediction in general.

The key benefits of the approach are the ability to train a single end-to-end model directly on
source and target sentences and the ability to handle variable length input and output
sequences of text.

An Encoder-Decoder architecture was developed where an input sequence was read in


entirety and encoded to a fixed-length internal representation.

A decoder network then used this internal representation to output words until the end of
sequence token was reached. LSTM networks were used for both the encoder and decoder.

The encoder-decoder architecture is a deep learning architecture used in many natural


language processing and computer vision applications. It consists of two main components:
an encoder and a decoder.

The most fundamental building blocks or components used to build the encoder-
decoder architecture is neural network. Different kind of neural networks including RNN,
LSTM, CNN, can be used based on encoder decoder architecture.

Dept: CAI Page 9 of 15


NB-SEAGI DL(R20)-Unit-5

In this architecture, the input data is first fed through what’s called as an encoder network.
The encoder network maps the input data into a numerical representation that captures the
important information from the input. Thee numerical representation of the input data is also
called as hidden state. The numerical representation (hidden state) is then fed into what’s
called as the decoder network. The decoder network generates the output by generating one
element of the output sequence at a time. The following picture represents the encoder
decoder architecture as explained here. Note that both input and output sequence of data can
be of varying length as shown in the picture below.

A popular form of neural network architecture called as autoencoder is a type of the encoder
decoder architecture. An autoencoder is a type of neural network architecture that uses an
encoder to compress an input into a lower-dimensional representation, and a decoder to
reconstruct the original input from the compressed representation. It is primarily used for
unsupervised learning and data compression. The other types of encoder-decoder architecture
can be used for supervised learning tasks, such as machine translation, image captioning, and
speech recognition. In this architecture, the encoder maps the input to a fixed-length
representation, which is then passed to the decoder to generate the output. So while the
encoder-decoder architecture and autoencoder have similar components, their main purposes
and applications differ.

Deep Recurrent Networks

The computation in most RNNs can be decomposed into three blocks of parameters and
associated transformations:

1. from the input to the hidden state,

2. from the previous hidden state to the next hidden state, and

3. from the hidden state to the output.

Dept: CAI Page 10 of 15


NB-SEAGI DL(R20)-Unit-5

With the RNN architecture of above fig., each of these three blocks is associated with a single
weight matrix. In other words, when the network is unfolded, each of these corresponds to a
shallow transformation.

By a shallow transformation, we mean a transformation that would be represented by a single


layer within a deep MLP.

Typically this is a transformation represented by a learned affine transformation followed by


a fixed nonlinearity.

Graves et al. were the first to show a significant benefit of decomposing the state of an RNN
into multiple layers as in below Fig.(left).

We can think of the lower layers in the hierarchy depicted in above Fig (a) as playing a role
in transforming the raw input into a representation that is more appropriate, at the higher
levels of the hidden state.

Dept: CAI Page 11 of 15


NB-SEAGI DL(R20)-Unit-5

Considerations of representational capacity suggest to allocate enough capacity in each of


these three steps, but doing so by adding depth may hurt learning by making optimization
difficult.

In general, it is easier to optimize shallower architectures, and adding the extra depth of
above Fig. (b) makes the shortest path from a variable in time step to a variable in time
step become longer.

However, this can be mitigated by introducing skip connections in the hidden-to-hidden path,
as illustrated in above Fig.(c).

Recursive Neural Network

Recursive Neural Networks (RvNNs) are deep neural networks used for natural language
processing. We get a Recursive Neural Network when the same weights are applied
recursively on a structured input to obtain a structured prediction.

Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn
detailed and structured information. With RvNN, you can get a structured prediction by
recursively applying the same set of weights on structured inputs. The word recursive
indicates that the neural network is applied to its output.

Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data.
The tree structure means combining child nodes and producing parent nodes. Each child-
parent bond has a weight matrix, and similar children have the same weights. The number of
children for every node in the tree is fixed to enable it to perform recursive operations and
use the same weights. RvNNs are used when there's a need to parse an entire sentence.

Challenges of Long Term Dependencies

Neural network optimization face a difficulty when computational graphs become deep, e.g.,

 Feedforward networks with many layers

 RNNs that repeatedly apply the same operation at each time step of a long
temporal sequence

Gradients propagated over many stages tend to either vanish (most of the time) or explode
(damaging optimization)

The difficulty with long-term dependencies arise from exponentially smaller weights given to
long-term interactions (involving multiplication of many Jacobians)

Echo State Network

Echo state network is a type of Recurrent Neural Network, part of the reservoir computing
framework, which has the following particularities:

Dept: CAI Page 12 of 15


NB-SEAGI DL(R20)-Unit-5

 the weights between the input -the hidden layer ( the ‘reservoir’) : Win and also the
weights of the ‘reservoir’: Wr are randomly assigned and not trainable

 the weights of the output neurons (the ‘readout’ layer) are trainable and can be learned
so that the network can reproduce specific temporal patterns

 the hidden layer (or the ‘reservoir’) is very sparsely connected (typically < 10%
connectivity)

 the reservoir architecture creates a recurrent non linear embedding (H on the image
below) of the input which can be then connected to the desired output and these final
weights will be trainable

 it is possible to connect the embedding to a different predictive model (a trainable NN


or a ridge regressor/SVM for classification problems)

Long Short Term Memory(LSTM)

 LSTM

 Bidirectional LSTM

Need of LSTM

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to
handle situations where RNNs fail.

 It fails to store information for a longer period of time. At times, a reference to certain
information stored quite a long time ago is required to predict the current output. But
RNNs are absolutely incapable of handling such “long-term dependencies”.

 There is no finer control over which part of the context needs to be carried forward
and how much of the past needs to be ‘forgotten’.

 Other issues with RNNs are exploding and vanishing gradients (explained later)
which occur during the training process of a network through backtracking.

Thus, Long Short-Term Memory (LSTM) was brought into the picture. It has been so
designed that the vanishing gradient problem is almost completely removed, while the
training model is left unaltered. Long-time lags in certain problems are bridged using LSTMs
which also handle noise, distributed representations, and continuous values. With LSTMs,
there is no need to keep a finite number of states from beforehand as required in the hidden
Markov model (HMM). LSTMs provide us with a large range of parameters such as learning
rates, and input and output biases.

Dept: CAI Page 13 of 15


NB-SEAGI DL(R20)-Unit-5

Structure of LSTM

The basic difference between the architectures of RNNs and LSTMs is that the hidden layer
of LSTM is a gated unit or gated cell. It consists of four layers that interact with one another
in a way to produce the output of that cell along with the cell state. These two things are then
passed onto the next hidden layer. Unlike RNNs which have got only a single neural net layer
of tanh, LSTMs comprise three logistic sigmoid gates and one tanh layer. Gates have been
introduced in order to limit the information that is passed through the cell. They determine
which part of the information will be needed by the next cell and which part is to be
discarded. The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means
‘include all’.

Information is retained by the cells and the memory manipulations are done by the gates.
There are three gates which are explained below:

Forget Gate The information that is no longer useful in the cell state is removed with the
forget gate. Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of bias. The
resultant is passed through an activation function which gives a binary output. If for a
particular cell state, the output is 0, the piece of information is forgotten and for output 1, the
information is retained for future use.

Input gate The addition of useful information to the cell state is done by the input gate. First,
the information is regulated using the sigmoid function and filter the values to be
remembered similar to the forget gate using inputs h_t-1 and x_t. Then, a vector is created
using the tanh function that gives an output from -1 to +1, which contains all the possible
values from h_t-1 and x_t. At last, the values of the vector and the regulated values are
multiplied to obtain useful information.

Output gate The task of extracting useful information from the current cell state to be
presented as output is done by the output gate. First, a vector is generated by applying the
tanh function on the cell. Then, the information is regulated using the sigmoid function and
filtered by the values to be remembered using inputs h_t-1 and x_t. At last, the values of the
Dept: CAI Page 14 of 15
NB-SEAGI DL(R20)-Unit-5

vector and the regulated values are multiplied to be sent as an output and input to the next
cell.

Bidirectional LSTM

Bidirectional LSTM or BiLSTM is a term used for a sequence model which contains two
LSTM layers, one for processing input in the forward direction and the other for processing
in the backward direction. It is usually used in NLP-related tasks. The intuition behind this
approach is that by processing data in both directions, the model is able to better understand
the relationship between sequences (e.g. knowing the following and preceding words in a
sentence).

To better understand this let us see an example. The first statement is “Server can you bring
me this dish” and the second statement is “He crashed the server”. In both these statements,
the word server has different meanings and this relationship depends on the following and
preceding words in the statement. The bidirectional LSTM helps the machine to understand
this relationship better than compared with unidirectional LSTM. This ability of BiLSTM
makes it a suitable architecture for tasks like sentiment analysis, text classification, and
machine translation.

Dept: CAI Page 15 of 15

You might also like