5. DEEP UNIT 3 F (1)
5. DEEP UNIT 3 F (1)
UNIT-III
RECURRENT NEURAL NETWORKS
• Typical RNNs have extra architectural features such as output layers that read
information out of state h to make predictions
A Recurrent Network with No Outputs
This network just processes information from input x by incorporating it into state h that is
passed forward through time.
Typical RNNs will add extra architectural features such as output layers to read
information out of the state h to make predictions.
Unfolding in Recurrent Neural Networks (RNNs)
In RNNs, the network can be thought of as a graph where nodes represent neurons and edges
represent connections with weights. These connections form cycles, representing the
Applications
Sequence Prediction: In RNNs, unfolding helps in tasks like language modeling,
time series forecasting, and more.
Node Classification: In GNNs, unfolding is useful for node classification tasks where
node embeddings are iteratively refined.
Graph Classification: GNNs can be used to classify entire graphs by unfolding and
aggregating information from all nodes.
Disadvantages
1. Vanishing and Exploding Gradients:
Vanishing Gradients: As the network is unfolded over many time steps, the
gradients can become very small, leading to very slow learning or the inability to
learn long-term dependencies.
2. Exploding Gradients: Conversely, the gradients can also grow exponentially, causing the
model parameters to become unstable and leading to numerical overflow.
c. Transducer RNN
Input Sequence (x) -> Acceptor -> Encoder -> Decoder -> Output Sequence (y)
a) Input Sequence (x): This is the sequence of input data that the model processes.
b) Output Sequence (y): The desired output sequence that the model aims to generate or
predict.
c) Acceptor: Initial processing step that prepares the input sequence for further
encoding.
d) Encoder: Converts the processed input sequence into a fixed-size context vector.
e) Decoder: Utilizes the context vector from the encoder to generate an output sequence
step-by-step.
f) Decoder hidden state: Maintains the state of the decoder RNN across time steps,
influencing subsequent outputs.
Applications
It possesses many applications such as
• Google’s Machine Translation
• Question answering chatbots
• Speech recognition
• Time Series Application etc.,
After that, the results from these hidden layers are collected and input into a prediction-
making final layer. The goal of a Bi-RNN is to capture the contextual dependencies in the
input data by processing it in both directions, which can be useful in various natural
language processing (NLP) tasks.
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the
data in the forward direction, while the other processes it in the reverse direction. The
outputs of these two RNNs are then combined in some way to produce the final output.
Encoder-Decoder Structure:
Encoder: Takes an input sequence (e.g., a sentence in one language) and processes it
into a fixed-size context vector, which represents the input sequence in a semantic
space.
Decoder: Takes this context vector and generates an output sequence (e.g., a
translated sentence in another language) one token at a time.
Recurrent Neural Networks can easily map sequences to sequences when the alignment
between the inputs and the outputs are known in advance. Although the vanilla version of
RNN is rarely used, its more advanced version i.e. LSTM or GRU is used. This is because
RNN suffers from the problem of vanishing gradient. LSTM develops the context of the word
by taking 2 inputs at each point in time. One from the user and the other from its previous
output, hence the name recurrent (output goes as input).
Advantages:
Improved Representation Learning: Each layer in a deep RNN learns increasingly
abstract representations of the input sequence, potentially leading to better
performance on tasks that require understanding hierarchical structures.
Enhanced Modeling Capabilities: Deeper architectures can capture more intricate
dependencies and patterns in sequential data compared to shallow RNNs or
feedforward networks.
Due to their deep tree-like structure, Recursive Neural Networks can handle
hierarchical data. The tree structure means combining child nodes and producing parent nodes.
Each child-parent bond has a weight matrix, and similar children have the same weights. The
number of children for every node in the tree is fixed to enable it to perform recursive
operations and use the same weights. RvNNs are used when there's a need to parse an entire
sentence.
A recursive network represent yet another generalization of recurrent networks, with
a different kind of computational graph, which is structured as a deep tree, rather than the
chain-like structure of RNNs. The typical computational graph for a recursive network is
illustrated in below fig.. Recursive neural networks were introduced by Pollack (1990).
Recursive networks have been successfully applied to processing data structures as input to
neural nets in natural language processing as well as in computer vision. One clear advantage
of recursive nets over recurrent nets is that for a sequence of the same length τ, the depth
(measured as the number of compositions of nonlinear operations) can be drastically reduced
from τ to O(log τ ), which might help deal with long-term dependencies. An open question is
how to best structure the tree. One option is to have a tree structure which does not depend on
the data, such as a balanced binary tree.
In some application domains, external methods can suggest the appropriate tree
structure. For example, when processing natural language sentences, the tree structure for the
recursive network can be fixed to the structure of the parse tree of the sentence provided by a
natural language parser. Ideally, one would like the learner itself to discover and infer the tree
structure that is appropriate for any given input, as suggested by Bottou (2011). Many
variants of the recursive net idea are possible.
Pros: Compared with a RNN, for a sequence of the same length τ, the depth (measured
as the number of compositions of nonlinear operations) can be drastically reduced from τ to
O(logτ).
o
Applications of Recursive Neural Networks:
Natural Language Processing (NLP): RecNNs are used for tasks such as parsing,
sentiment analysis, and text classification where sentences or phrases can be
represented hierarchically.
Image Processing: They can be adapted to process hierarchical representations in
images, such as object recognition in scenes.
Bioinformatics: Analyzing biological data, such as protein structure prediction,
where molecular structures are naturally hierarchical.
Advantages:
o Hierarchical Representation: RecNNs naturally capture hierarchical
relationships in data.
o Flexibility: They can handle variable-sized input structures and adapt to
different domains with hierarchical data.
Challenges:
o Complexity: Designing and training RecNNs can be more complex compared
to simpler neural network architectures.
o Scalability: Handling large and deep hierarchical structures may require
careful design to avoid computational bottlenecks.
Disadvantages of RvNNs
The main disadvantage of recursive neural networks can be the tree structure. Using
the tree structure indicates introducing a unique inductive bias to our model. The bias
corresponds to the assumption that the data follow a tree hierarchy structure. But that is
not the truth. Thus, the network may not be able to learn the existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be
slow and ambiguous. Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for
recursive neural networks than to construct recurrent neural networks. Manually
parsing a sentence into short components is more time-consuming and tedious than
assigning a label to a sentence.
Leaky units, particularly Leaky ReLU activations, are used to address the vanishing gradient
problem by allowing a small, non-zero gradient for negative inputs:
3. Residual Connections
The advantage for dropout is that first term can be approximated in one pass of the complete
model by dividing the weight values by the keep probability (weight scaling inference rule).
The motivation behind this is to capture the right expected values from the output of each unit,
i.e. the total expected input to a unit at train time is equal to the total expected input at test time.
A big advantage of dropout then is that it doesn’t place any restriction on the type of model or
training procedure to use.
Points to note:
Reduces the representational capacity of the model and hence, the model should be
large enough to begin with.
Works better with more data.
Equivalent to L² for linear regression, with different weight decay coefficient for each
input feature.
Dropout in RNNs
In RNNs, dropout can be applied in several ways:
1. Input Dropout: Dropout is applied to the input features.
2. Output Dropout: Dropout is applied to the output of the RNN layer.
3. Recurrent Dropout: Dropout is applied to the connections between the recurrent
units.
Key Components
Cell State (C_t): Acts as a memory that carries information across different time
steps.
Hidden State (h_t): The output at the current time step, used for predictions and fed
into the next cell.
Equations
Advantages of LSTM
Long-Term Dependency Learning: Unlike traditional RNNs, LSTMs can learn
dependencies over long sequences, which is crucial for tasks like language modeling
and time-series forecasting.
Avoids Vanishing Gradient Problem: The gating mechanism helps mitigate the
vanishing gradient problem, allowing the model to retain important information over
extended periods.
Applications
Natural Language Processing (NLP): Language translation, sentiment analysis, and
speech recognition.
Time-Series Analysis: Stock price prediction, weather forecasting, and anomaly
detection.
Sequential Data Processing: Video analysis, music composition, and handwriting
recognition.
LSTMs have been widely adopted in various fields due to their ability to handle complex
sequential data effectively.
Parameter Sharing: RNNs share the same set of parameters across all time steps, which
reduce the number of parameters that need to be learned and can lead to better
generalization.
Non-Linear Mapping: RNNs use non-linear activation functions, which allow them to
learn complex, non-linear mappings between inputs and outputs.
Sequential Processing: RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.
a. Encoder
b. Hidden Vector
c. Decoder
PART-C
1. Develop an example for Unfolding Computational Graphs and describe theMajor
advantages of unfolding process.
2. Explain how to compute the gradient in a Recurrent Neural Network.
3.Explain a modeling sequences Conditioned on Context with RNNs.
4.Prepare an example of Encoder- Decoder or sequence-to-sequence RNN architecture.
5Explain various Gated RNNs.
6. Explain the steps to developing the necessary assumption structure in Deep learning?
7. Explain the difference between a Shallow Network and a Deep Network.
8. For the application of Face Detection, which deep learning algorithm would you use?