Recurrent and Recursive Neural Networks
Recurrent and Recursive Neural Networks
Recurrent and Recursive Neural Networks: Unfolding Computational Graphs, Recurrent Neural Network,
Bidirectional RNNs, Deep Recurrent Networks, Recursive Neural Networks, The Long ShortTerm Memory
and Other Gated RNNs. Applications: Large-Scale Deep Learning, Computer, Speech Recognition, Natural
Language Processing and Other Applications.
where we see that the state now contains information about the whole past sequence.
Recurrent neural networks can be built in many different ways. Much as almost any function
can be considered a feedforward neural network, essentially any function involving recurrence
can be considered a recurrent neural network.
Module 5
Recurrent neural networks (RNNs) are different from other neural networks with their
unique capabilities:
Internal Memory: This is the key feature of RNNs. It allows them to remember past
inputs and use that context when processing new information.
Sequential Data Processing: Because of their memory, RNNs are exceptional at
handling sequential data where the order of elements matters. This makes them ideal
for speech recognition, machine translation, natural language processing
(NLP) and text generation.
Contextual Understanding: RNNs can analyze the current input in relation to what
they’ve “seen” before. This contextual understanding is crucial for tasks where
meaning depends on prior information.
Dynamic Processing: RNNs can continuously update their internal memory as they
process new data, allowing them to adapt to changing patterns within a sequence.
important design patterns for recurrent neural networks include the following
Module 5
Recurrent networks that produce an output at each time step and have recurrent
connections between hidden units, illustrated in figure 10.3.
Recurrent networks that produce an output at each time step and have recurrent
connections only from the output at one time step to the hidden units at the next time
step, illustrated in figure 10.4
Recurrent networks with recurrent connections between hidden units, that read an
entire sequence and then produce a single output,
Figure 10.4: An RNN whose only recurrence is the feedback connection from the output to the hidden layer
Module 5
Forward propagation begins with a specification of the initial state. Then, for each time step
from t=1 to t= τ, we apply the following update equations:
Where the parameters are the bias vectors b and c along with the weight matrices U, V and W,
respectively for input-to-hidden, hidden-to-output and hidden-to hidden connections.
Computing the gradient through a recurrent neural network is straightforward. One simply
applies the generalized back-propagation algorithm to the unrolled computational graph.
Module 5
The gradients for the shared parameters W, U, V, b, c are obtained by summing over all time
steps:
One way to interpret an RNN as a graphical model is to view the RNN as defining a graphical model
whose structure is the complete graph, able to represent direct dependencies between any pair of y
values with the complete graph structure is shown in figure 10.7
graph interpretation of the RNN is based on ignoring the hidden units h(t ) by marginalizing them out of
the model.
It is more interesting to consider the graphical model structure of RNNs that results from regarding the
hidden units h(t ) hidden units in the graphical model reveals that the RNN provides a very efficient
parametrization of the joint distribution over the observations.
The edges in a graphical model indicate which variables depend directly on other variables
Module 5
Explain how the Recurrent neural network (RNN) processes data sequences(Model QP QNo 9.)
Modeling Sequences Conditioned on Context with RNNs focuses on how recurrent neural networks
(RNNs) can generate or predict sequential data while taking additional contextual information into
account. This approach is critical in tasks such as conditional language modeling, video captioning, or
time-series prediction conditioned on external variables
Module 5
Figure 10.9: An RNN that maps a fixed-length vector x into a distribution over sequences Y
The first and most common approach is illustrated in figure10.9 . The interaction between the input x
and each hidden unit vector h(t) is parametrized by newly introduced weight matrix R that was absent
from the model of only the sequence of y values.
Bidirectional RNNs :
Q. Encoder-Decoder Architecture
Module 5
The encoder-decoder architecture is a type of RNN design that processes an input sequence of
variable length and generates an output sequence of potentially different length. This is
particularly useful in tasks like:
The computation in most RNNs can be decomposed into three blocks of parameters and associated
transformations:
2. from the previous hidden state to the next hidden state, and
each of these three blocks is associated with a single weight matrix. In other words, when the network is
unfolded, each of these corresponds to a shallow transformation.
(a) . The hidden recurrent state can be broken down into groups organized hierarchically. Fig a
(b) Deeper computation (e.g., an MLP) can be introduced in the input-to hidden, hidden-to-hidden and
hidden-to-output parts. This may lengthen the shortest path linking different time steps. fig b
(c) The path-lengthening effect can be mitigated by introducing skip connections fig c
Recursive neural networks represent yet another generalization of recurrent networks, with a different
kind of computational graph, which is structured as a deep tree, rather than the chain-like structure of
RNNs. The typical computational graph for a recursive network is illustrated in figure 10.14 .
Figure 10.14: A recursive network has a computational graph that generalizes that of the recurrent
network from a chain to a tree.
Recursive networks have been successfully applied to processing data structures as input to neural nets,
in natural language processing, computer vision
10.10 The Long Short-Term Memory and Other Gated RNN (Model QP imp)
ost effective sequence models used in practical applications are called gated RNNs Networks. These
include long short-term memory and networks based on the gated recurrent unit.
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN)
specifically designed to handle long-term dependencies in sequential data. They address the
vanishing gradient problem of standard RNNs, enabling the effective modeling of longer
sequences.
Module 5
An LSTM unit consists of a memory cell (ct) and three gating mechanisms—input, forget, and
output gates—that regulate the flow of information:
Module 5
In LSTM, old information is forgetted using forget gate, add new information using input gate,
compute output using output gate,
Advantages of LSTMs
Chapter 12
Deep learning is based on the philosophy of connectionism: deep learning APPLICATIONS requires high
performance hardware and software infrastructure. The following are requirements :
1. Fast CPU Implementations: Traditionally, neural networks were trained using the CPU of a single
machine. Today, this approach is generally considered insufficient. Instead we use, GPU
computing or the CPUs of many machines networked together.
2. GPU Implementations:
Origins in Graphics: GPUs were originally developed for rendering 3D graphics in
video games, requiring fast, parallel computation of simple operations.
Parallelism: GPUs excel at handling parallel operations, such as matrix
multiplications and pixel color calculations, which are foundational to both graphics and
neural networks.
High Memory Bandwidth: GPUs can efficiently handle large data buffers, making
them ideal for neural network training, where memory bandwidth is a bottleneck on
traditional CPUs.
Hardware Flexibility: The evolution from specialized graphics hardware to general-
purpose GPUs (GP-GPUs) enabled broader scientific and machine learning applications.
3. Large-Scale Distributed Implementations:
In many cases, the computational resources available on a single machine are insufficient.
Workload of training is distributed and inference across many machines. Distributing inference
is simple, because each input example we want to process can be run by a separate machine.
This is known as data parallelism.
Another type, model parallelism, where multiple machines work together on a single
datapoint, with each machine running a different part of the model. This is feasible for both
inference and training.
4. Model Compression: The process of replacing a large, resource-intensive model with a
smaller, more efficient model that approximates the original model’s performance is
called Model Compression. It Reduce memory and runtime costs during inference. It also
helps to deploy lightweight models that deliver high performance while operating within
the constraints of limited hardware. Efficiency Gains: Minimized runtime and memory
usage, enabling real-time applications and lower energy consumption.
5. Dynamic Structure: Data processing systems can dynamically determine which subset of many
neural networks should be run on a given input. Individual neural networks can also exhibit
dynamic structure internally by determining which subset of features (hidden units) to compute
given information from the input. This form of dynamic structure inside neural networks is
sometimes called conditional computation
6. Specialized Hardware Implementations of Deep Networks:
Over the years, Application-Specific Integrated Circuits (ASICs), Field Programmable Gate
Arrays (FPGAs), and hybrid hardware solutions combining both digital and analog components
have been developed to optimize the performance of deep networks.
Module 5
The progress in the raw speed of CPUs or GPUs has slowed down due to physical
limitations. As a result, most performance improvements have come from parallelization
rather than single-core speed increases.
Specialized hardware, like ASICs and FPGAs, can continue to advance performance
by optimizing specific computations in neural networks, pushing the envelope beyond
what general-purpose hardware can achieve.
The task of speech recognition is to map an acoustic signal containing a spoken natural language
utterance into the corresponding sequence of words intended by the speaker. Let X = ( x (1), x(2), ...,x(T))
denote the sequence of acoustic input vectors.
In the Early Era, HMM (Hidden Markov Models) and GMM (Gaussian Mixture Models) were used for
Speech Recogntion.IN the Deep Learning Era, CNN , RNN and other deep earning models were used for
this purpose.
Natural language processing (NLP) is the use of human languages, such as English or French, by a
computer. Natural language processing includes applications such as machine translation, in which the
learner must read a sentence in one human language and emit an equivalent sentence in another
human language. Many NLP applications are based on language models that define a probability
distribution over sequences of words, characters or bytes in a natural language
thus to combine both approaches in an ensemble consisting of a neural language model and
n-gram language model
v) Machine translation : is the task of reading a sentence in one natural language and emitting
a sentence with the equivalent meaning in another language
other types of applications of deep learning that are different from the standard object recognition,
speech recognition and natural language processing tasks are discussed below:
i) Recommender Systems: One of the major families of applications of machine learning in the
information technology sector is the ability to make recommendations of items to potential
users or customers. Two major types of applications can be distinguished: online advertising
and item recommendations Companies including Amazon and eBay use machine learning,
including deep learning, for their product recommendations.
ii) Knowledge Representation, Reasoning : Deep learning approaches have been very
successful in language modeling, machine translation and natural language processing due
to the use of embeddings for symbols and words .These embeddings represent semantic
knowledge about individual words and concepts.
iii) Knowledge, Relations and Question Answering: One interesting research direction is
determining how distributed representations can be trained to capture the relations
between two entities. These relations allow us to formalize facts about objects and how
objects interact with each other