0% found this document useful (0 votes)
21 views

Notes - CSE (DS)

Uploaded by

rajeshkollu81
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Notes - CSE (DS)

Uploaded by

rajeshkollu81
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

III Year I Sem Deep Learning

Unit III
Topics to be Discussed:
CNN Architecture: Convolutions, Convolutional Layers, Pooling Strategies, LeNet, AlexNet,
ZF-Net, VGGNet, GoogLeNet, ResNet, Visualizing Convolutional Neural Networks, Guided
Propagation, Deep Dream, Fooling Convolutional Neural Networks.

Architecture of ResNet (along With its Pooling Strategies)

Basic Components of ResNet Architecture


Initial Convolution and Pooling:
The network starts with a standard convolutional layer with a small filter size (7x7 or 3x3
typically) and a stride of 2.
This is followed by batch normalization and a ReLU activation.
Then a max pooling layer with a filter size of 3x3 and a stride of 2 is typically applied.
Residual Blocks:
The core idea of ResNet is a residual block with skip connections.
Each residual block has two or three convolutional layers with batch normalization and ReLU
activation after each convolution.
The output of the last convolution in the block is added to the block's input, effectively forming
a shortcut or skip connection. This is where the term "residual" comes from, as these
connections allow the training of the residual functions with reference to the layer inputs.
Skip Connections:
Skip connections help mitigate the vanishing gradient problem by allowing gradients to flow
through a shortcut path.
In blocks where the input and output dimensions do not match, either a convolutional layer
or zero-padding is used on the shortcut path to match the dimensions.
Pooling Strategies in ResNet
Max Pooling:
Used after the initial convolutional layer to reduce the spatial dimensions.

Department of CSE(Data Science Page No: 1/53


)
III Year I Sem Deep Learning

Max pooling is also employed within some residual blocks to reduce the feature map size and
increase the receptive field.
Average Pooling:
Towards the end of the network, an average pooling layer is used to reduce each feature
map to a single number.
The average pooling is typically global, meaning it pools across the entire spatial dimension
of the feature map.

ResNet Variants
ResNet-18 and ResNet-34:
• These smaller ResNets use a basic residual block structure with two 3x3 convolutional
layers per block.

ResNet-50, ResNet-101, and ResNet-152:


• These larger versions use a "bottleneck" design to make the network deeper while
controlling the number of parameters.
• Each bottleneck block contains three layers: a 1x1 convolution that reduces the
dimension, a 3x3 convolution, and another 1x1 convolution that increases the
dimension.

Final Layers
• After the final residual block, ResNet uses a global average pooling layer to reduce
each feature map to a single vector.
• This is followed by a fully connected layer with softmax activation to produce the
output probabilities for classification tasks.

Special Features
• Identity Mappings: The skip connections perform identity mappings, where the input
is added directly to the output of the residual block.
• Deep Batch Normalization: Batch normalization is heavily used throughout ResNet,
which helps in regularizing and speeding up training.

Applications:
• Image Classification: ResNet's initial claim to fame came from its performance on
image classification tasks. It is often used in scenarios where the recognition of objects
within images is required, such as identifying categories of images in a large dataset.
• Object Detection: In combination with systems like R-CNN (Region-based
Convolutional Neural Networks), ResNet can be used to detect objects within an image
and classify them. This has applications in surveillance, autonomous vehicles, and
many areas of research.
• Image Segmentation: ResNet can be used for semantic segmentation, where the goal
is to classify each pixel of the image as belonging to a particular class. This is useful
in medical imaging, satellite image analysis, and autonomous driving for
understanding the scene at a pixel level.

Department of CSE(Data Science Page No: 2/53


III Year I Sem Deep Learning

• Video Analysis: When combined with RNNs or 3D convolutional networks, ResNet can
process video data to understand and classify actions, detect anomalies, or even
generate descriptions of video content.
• Facial Recognition: The depth and complexity of features learned by ResNet make it
well-suited for facial recognition tasks, which require distinguishing subtle details to
differentiate between different faces.
• Medical Imaging: ResNet architectures are used to detect and diagnose diseases from
medical images like X-rays, MRIs, or CT scans, by identifying patterns associated with
various medical conditions.
• Transfer Learning: Due to its depth and robust feature representation, ResNet
models pre-trained on large datasets like ImageNet are often used as feature
extractors for various tasks. This transfer learning can significantly improve
performance even with limited data.
• Content-based Image Retrieval: ResNet can be used to extract features from images,
which can then be used to find similar images within a large database, useful in
digital image libraries, e-commerce, and more.
• Augmented Reality: In AR applications, ResNet can be used for real-time image
classification and object detection, which are essential for overlaying digital
information on the real world.
• Agriculture: For precision farming, ResNet can help in analyzing crop health through
aerial images, detecting pests, and predicting yield.

Architecture of VGGNet (along With its Pooling Strategies)

Architecture of VGGNet
Uniform Convolutional Layers:
• The defining characteristic of VGGNet is the use of a series of convolutional layers
with 3×3 filters and a stride of 1, which are stacked on top of each other.

Department of CSE(Data Science Page No: 3/53


III Year I Sem Deep Learning

• The use of smaller filters but more depth allows VGGNet to capture complex features
from the input image while keeping computational requirements in check.

Increasing Depth:
• The depth of the network increases from the input layer to the output layer, with
configurations typically having 16 (VGG16) or 19 (VGG19) weighted layers.
• The number of filters in the convolutional layers starts at 64 and increases by a factor
of 2 after each max pooling layer, up to 512.

Pooling Strategies in VGGNet


Max Pooling:
• VGGNet uses max pooling to reduce the spatial dimensions of the output from the
convolutional layers.
• After several convolutional layers, a max pooling layer with a 2×2 filter and a stride of
2 is applied. This reduces the width and height of the output by half, while the depth
increases as you move through the network.

Fixed Pooling Strategy:


• The pooling strategy in VGGNet is fixed. After every few convolutional layers (typically
two or three, depending on the specific model), a max pooling layer is used to halve
the dimensions of the feature maps.
• This pooling strategy is consistently applied throughout the network, contributing to
the model's simplicity.

Fully Connected Layers


• Following several blocks of convolutional and max pooling layers, VGGNet concludes
with three fully connected layers.
• The first two fully connected layers have 4096 channels each, and the third
performs classification, typically having as many channels as the number of classes
in the dataset (e.g., 1000 for ImageNet).
• Each fully connected layer is followed by a ReLU activation function.

Additional Features
• ReLU Activation: Throughout the network, the ReLU (Rectified Linear Unit) activation
function is used after each convolutional layer.
• Softmax Classifier: The final fully connected layer is followed by a softmax activation
function that outputs a probability distribution over the classes.

Applications:
• Image Classification: VGGNet, with its deep architecture, is excellent for image
classification tasks, where the goal is to categorize images into predefined classes.
• Object Detection: By using it as a feature extractor, VGGNet can be integrated into
object detection frameworks like R-CNN (Region-based Convolutional Neural
Networks) to locate and classify objects within images.
• Image Segmentation: VGGNet's features can be used in segmentation tasks to
classify each pixel of an image into a category, useful in medical imaging and
autonomous driving.

Department of CSE(Data Science Page No: 4/53


III Year I Sem Deep Learning

• Transfer Learning: Due to its performance on the ImageNet dataset, VGGNet is often
used as a pre-trained model for transfer learning on other visual recognition tasks.
The learned filters can be effective feature descriptors even for datasets quite different
from the data on which the network was originally trained.
• Content-Based Image Retrieval: The deep features extracted by VGGNet can be used
to find similar images in a database by comparing feature vectors.
• Facial Recognition: The network can be trained to recognize and verify faces, as its
deep layers capture the complex structures of faces.
• Video Analysis: VGGNet can be used to extract features from video frames for tasks
such as activity recognition or motion analysis.
• Style Transfer: The feature space representations found in VGGNet are used to
transfer the style of one image to the content of another, as in creating artistic versions
of photographs.
• Medical Diagnosis: VGGNet's architecture is powerful for medical image analysis,
helping in diagnosing diseases from images like X-rays, MRIs, and CT scans.
• Augmented Reality: VGGNet can be used to understand scenes in augmented reality
applications for enhancing real-world environments with digital overlays.

Role of Batch Normalization in Convolutional Neural Networks


Training Deep Neural Networks is a difficult task that involves several problems to tackle.
Despite their huge potential, they can be slow and be prone to overfitting. Thus, studies on
methods to solve these problems are constant in Deep Learning research.
Batch Normalization commonly abbreviated as Batch Norm – is one of these methods.
Currently, it is a widely used technique in the field of Deep Learning. It improves the
learning speed of Neural Networks and provides regularization, avoiding overfitting.

Normalization
Normalization is a pre-processing technique used to standardize data. In other words, having
different sources of data inside the same range. Not normalizing the data before training can
cause problems in our network, making it drastically harder to train and decrease its learning
speed.
For example, imagine we have a car rental service. Firstly, we want to predict a fair price for
each car based on competitors’ data. We have two features per car: the age in years and the
total amount of kilometers it has been driven for. These can have very different ranges,
ranging from 0 to 30 years, while distance could go from 0 up to hundreds of thousands of
kilometers. We don’t want features to have these differences in ranges, as the value with the
higher range might bias our models into giving them inflated importance.
There are two main methods to normalize our data. The most straightforward method is to
scale it to a range from 0 to 1:

Department of CSE(Data Science Page No: 5/53


III Year I Sem Deep Learning

x the data point to normalize, m the mean of the data set, x_{max} the highest value, and
x_{min} the lowest value. This technique is generally used in the inputs of the data. The non-
normalized data points with wide ranges can cause instability in Neural Networks. The
relatively large inputs can cascade down to the layers, causing problems such as exploding
gradients.

The other technique used to normalize data is forcing the data points to have a mean of 0
and a standard deviation of 1, using the following formula:

being x the data point to normalize, m the mean of the data set, and s the standard deviation
of the data set. Now, each data point mimics a standard normal distribution. Having all the
features on this scale, none of them will have a bias, and therefore, our models will learn
better.

Batch Normalization:
Batch Norm is a normalization technique done between the layers of a Neural Network instead
of in the raw data. It is done along mini-batches instead of the full data set. It serves to speed
up training and use higher learning rates, making learning easier.

Following the technique explained in the previous section, we can define the normalization
formula of Batch Norm as:

being m_z the mean of the neurons’ output and s_z the standard deviation of the neurons’
output.
In the following figure, we can see a regular feed-forward Neural Network: x_i are the inputs,
z the output of the neurons, a the output of the activation functions, and y the output of the
network:

Batch Norm – in the image represented with a red line – is applied to the neurons’ output just
before applying the activation function. Usually, a neuron without Batch Norm would be
computed as follows:

Department of CSE(Data Science Page No: 6/53


III Year I Sem Deep Learning

being g() the linear transformation of the neuron, w the weights of the neuron, b the bias of
the neurons, and f() the activation function. The model learns the parameters w and b. Adding
Batch Norm, it looks as:

Department of CSE(Data Science Page No: 7/53


III Year I Sem Deep Learning

Unit IV
Topics to be Discussed:
Sequence Learning: Recurrent Neural Networks, Backpropagation through Time (BPTT),
Vanishing and Exploding Gradients, Truncated BPTT, Gated Recurrent Unit (GRU), Long
Short-Term Memory (LSTMs), Encoder-Decoder Models, Attention Mechanism, Attention
Over Images.

Sequence Learning
Sequence Learning refers to the process where algorithms are trained to identify and interpret
patterns in sequential data. This data is characterized by an inherent order - a sequence that
is significant for analysis and prediction. The order could be temporal (as in time series data)
or spatial (as in a sequence of images or text).
Sequential Data:
Sequential data refers to any kind of data where the order of the elements is significant and
carries important information. In sequential data, each element is not just an independent
data point but is often related to its preceding and/or succeeding elements. This order can
represent a progression over time (temporal sequences) or a specific arrangement in space
(spatial sequences).
Key Characteristics of Sequential Data:
• Order Matters: The sequence of the data points is crucial. Changing the order can
alter the meaning or the insights derived from the data.
• Contextual/Temporal Dependency: Each data point in a sequence typically depends
on previous and/or future data points. For example, each word in a sentence gets its
meaning in relation to the preceding words.
• Variable Length: Unlike regular datasets where each instance has a fixed number of
features, sequential datasets can vary in length. For instance, sentences can have
differing numbers of words.

Illustrative Example: Language Processing


Consider a sentence: "The quick brown fox jumps over the lazy dog." This sentence is a classic
example of sequential data. Here’s why:
• Order is Key: The meaning of the sentence is dependent on the order of the words. If
you shuffle the words, the sentence loses its original meaning.
• Contextual Relationship: Each word in the sentence gains meaning in relation to
the words around it. For instance, "quick" is understood as describing the "fox"
because of its position in the sequence.
• Temporal Sequence: Although not always explicit in text, language often follows a
temporal sequence, representing the flow of ideas or events over time.

Department of CSE(Data Science Page No: 8/53


III Year I Sem Deep Learning

Applications:
• Natural Language Processing (NLP): Involves tasks like speech recognition,
language translation, and sentiment analysis.
• Time Series Analysis: Used in financial forecasting, weather prediction, and stock
market analysis.
• Biological Sequence Analysis: Includes DNA sequencing and protein structure
prediction.
• Sensor Data Analysis: Utilized in IoT applications, health monitoring systems, and
activity recognition.

Importance in AI / ML / DL:
• Complex Pattern Recognition: Sequence learning algorithms can uncover complex
patterns that simpler, non-sequential models might miss.
• Predictive Analytics: They are crucial in fields where future predictions based on
past data are essential, like in stock market analysis or weather forecasting.
• Enhanced Interaction and Personalization: In AI applications, understanding
sequential user behavior data leads to more personalized and interactive user
experiences, such as in recommender systems.

Challenges in Sequence Learning:


• Handling Long Dependencies: Traditional models struggle to remember information
over long sequences, a challenge that newer models like LSTMs and GRUs aim to
address.
• Computational Complexity: Processing sequences, especially long ones, can be
computationally intensive.
• Overfitting and Generalization: Ensuring that a model trained on sequential data
generalizes well to unseen sequences is a common challenge.

Consider the Following Image as Sequence of Moves in a Video Performing Surya


Namaskar. In the Image the Length of Each Move (In Terms of Action or Movement) is
Not Same.

Sequence learning is a type of machine learning that involves understanding sequences of


data and making predictions based on that sequential information. This is particularly useful
for data where the order and context of individual elements are important for understanding

Department of CSE(Data Science Page No: 9/53


III Year I Sem Deep Learning

the whole. Sequence learning algorithms can process input data in the form of sequences to
predict future elements of the sequence or to classify the sequence into categories.

To perform sequence learning on the image provided, we would go through the following steps:
• Data Preparation: For the image of the Surya Namaskar sequence, the data
preparation would involve dividing the image into individual frames where each frame
represents a different pose or movement. Each frame would need to be labeled with
the appropriate step in the Surya Namaskar sequence.
• Feature Extraction: Extract features from each image that could be used to identify
the pose. In a real-world scenario, this might involve using a pre-trained convolutional
neural network to extract high-level features from each image.
• Sequence Modeling: Arrange the extracted features in the order of the sequence. This
ordered data can then be fed into a sequence learning model such as an RNN, LSTM,
or GRU. Each of these models has an inherent ability to remember the past states (or
poses, in this case) and learn the transition dynamics between them.
• Training the Model: The sequence model would be trained on the ordered data. It
would learn the typical transitions between poses, including the duration and
progression from one pose to another.
• Prediction/Classification: After training, the model could be used to predict the next
pose in the sequence, given the previous poses. It could also classify a new sequence
of poses as being a correct or incorrect sequence of Surya Namaskar.

Recurrent Neural Networks


Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed for
processing sequential data. Unlike traditional feedforward neural networks, RNNs have loops
in them, allowing information to persist. This architecture makes them inherently suited for
sequential tasks where the current output depends not just on the current input but also on
the previous outputs or states.
Core Features:
• Memory Capability: RNNs can remember previous inputs due to their internal loop,
which helps in maintaining a sort of 'memory' over the inputs they have processed.
• Sequential Data Processing: They are ideal for time-series data, text, audio, and
other forms of sequential data.
• Flexible Input and Output Lengths: RNNs can handle inputs and outputs of varying
lengths, which is essential for many types of sequential data.

RNNs are specifically designed to recognize patterns in sequences of data, making them highly
effective for tasks like language modeling, speech recognition, and time series forecasting.
The key is their ability to maintain a state or memory of previous inputs while processing new
ones. This characteristic allows them to capture temporal dependencies and contextual
relationships in the data, essential for sequence learning.
Recurrent Neural Networks (RNNs) are a type of artificial neural network designed to process
sequences of data. They work especially well for jobs requiring sequences, such as time series
data, voice, natural language, and other activities.

Department of CSE(Data Science Page No: 10/53


III Year I Sem Deep Learning

RNN works on the principle of saving the output of a particular layer and feeding this back
to the input in order to predict the output of the layer.

The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.

Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C
are the network parameters used to improve the output of the model. At any given time t, the
current input is a combination of input at x(t) and x(t-1). The output at any given time is
fetched back to the network to improve on the output.

Department of CSE(Data Science Page No: 11/53


III Year I Sem Deep Learning

How Does Recurrent Neural Networks Work?


Input Sequence: The RNN takes a sequence of vectors (x_1, x_2, ..., x_t) as input. Each vector
x_t corresponds to one time step or one element in the sequence.
Hidden State: At each time step t, the hidden state h_t of the RNN is updated by a function
of the current input x_t and the previous hidden state h_{t-1}. This function typically involves
a tanh or ReLU activation function:

Output Sequence: The output y_t at time step t (for some RNNs) is then calculated using the
current hidden state:

Where, Whh are the weights connecting the hidden state from the previous time step to the
current hidden state, Wxh are the weights connecting the current input to the hidden
state, and bh is the hidden bias term.
Backpropagation Through Time (BPTT): To train an RNN, BPTT is used where gradients
are calculated by backpropagating errors through each time step. However, this can be
computationally expensive and difficult to manage for long sequences due to the vanishing
and exploding gradient problems.

Features of RNNs
• Sequential Data Processing: Unlike feedforward neural networks, RNNs can process
data of varying lengths due to their recurrent structure.
• Memory of Past Inputs: The hidden state acts as the network's memory which
influences the network output and the next hidden state.

Applications of RNNs
• Language Modeling and Generation: RNNs can predict the probability of the next word
in a sentence based on the previous words, which is useful for generating text.
• Time Series Prediction: They can predict future stock prices or weather patterns based
on past data.
• Speech Recognition: RNNs can transcribe spoken words into text by processing the
audio sequence over time.

Department of CSE(Data Science Page No: 12/53


III Year I Sem Deep Learning

• Machine Translation: They can translate a sentence from one language to another,
processing the input sequence and generating a sequence in the target language.

Backpropagation through Time (BPTT)


Backpropagation through time (BPTT) is a method used in recurrent neural networks (RNNs)
to train the network by backpropagating errors through time. In a traditional feedforward
neural network, the data flows through the network in one direction, from the input layer
through the hidden layers to the output layer. However, in RNNs, there are connections
between nodes in different time steps, which means that the output of the network at one
time step depends on the input at that time step as well as the previous time steps.
BPTT works by unfolding the RNN over time, creating a series of interconnected feedforward
networks. Each time step corresponds to one layer in this unfolded network, and the weights
between layers are shared across time steps. The unfolded network can be thought of as a
very deep feedforward network, where the weights are shared across layers.
During training, the error is backpropagated through the unfolded network, and the weights
are updated using gradient descent. This allows the network to learn to predict the output at
each time step based on the input at that time step as well as the previous time steps.
However, BPTT has some challenges, such as the vanishing gradient problem, where the
gradients become very small as they propagate back in time, making it difficult to learn long-
term dependencies. To address this issue, various modifications of BPTT have been proposed,
such as truncated backpropagation through time and gradient clipping.
Backpropagation Through Time (BPTT) is a method used for training Recurrent Neural
Networks (RNNs). It's an extension of the standard backpropagation algorithm, modified to
handle the sequential nature of RNNs.
Sequential Nature of RNNs: Unlike standard neural networks, RNNs have connections that
loop back, meaning the output from one layer feeds back as input to the same layer. This
creates a sequence of operations over time.
Challenge in Training RNNs: Due to these recurrent connections, the gradient of the loss
function not only propagates backward through the layers of the network but also
backward through time.

Steps in BPTT
Forward Pass:
Just like in standard neural networks, the first step involves a forward pass where inputs
are fed into the network, and the network generates outputs at each time step.

Department of CSE(Data Science Page No: 13/53


III Year I Sem Deep Learning

In an RNN, this output depends on the current input and the previous hidden state (which
itself is a function of previous inputs).
Calculating the Loss:
• A loss function measures how far the network's output at each time step is from the
expected output.
• The total loss is often the sum of the losses at each time step.

Backward Pass:
• In the backward pass, gradients of the loss function are calculated with respect to
the weights.
• Since the output at each time step depends on computations from previous time steps,
the gradient at each time step must account for the entire history of inputs up to that
point.
• The gradients at each time step are backpropagated through the network, and due to
the recurrent nature, they are also propagated back in time.

Accumulating Gradients:
• Gradients from each time step are accumulated over the sequence, summing the
gradients across time steps for each weight.

Weight Update:
• After backpropagation is complete, the weights are updated using gradient descent or
other optimization techniques, taking into account the accumulated gradients.

BPTT is a widely used technique for training recurrent neural networks (RNNs) that can be
used for various applications such as speech recognition, language modeling, and time series
prediction. Here are some specific use cases for BPTT:
• Speech Recognition: BPTT can be used to train RNNs for speech recognition tasks,
where the network takes in a sequence of audio samples and predicts the
corresponding text. BPTT allows the network to learn the temporal dependencies in
the audio signal and use them to make accurate predictions.
• Language Modeling: BPTT can also be used to train RNNs for language modeling tasks,
where the network predicts the probability distribution of the next word in a sequence
given the previous words. This can be useful for applications such as text generation
and machine translation.
• Time Series Prediction: BPTT can be used to train RNNs for time series prediction
tasks, where the network takes in a sequence of data points and predicts the next
value in the sequence. BPTT allows the network to learn the temporal dependencies
in the data and use them to make accurate predictions.

Example of BPTT:
Let’s consider a simple example of using BPTT to train a recurrent neural network (RNN)
for time series prediction. Suppose we have a time series dataset that consists of a sequence
of data points: {x1, x2, x3, …, xn}. The goal is to train an RNN to predict the next value in the
sequence, xn+1, given the previous values in the sequence.

Department of CSE(Data Science Page No: 14/53


III Year I Sem Deep Learning

To do this, we can use BPTT to backpropagate errors through time and update the weights of
the RNN. Here’s how the BPTT algorithm might work:
Initialize the weights of the RNN randomly.
Feed the first input x1 into the RNN and compute the output y1.
Compute the loss between the predicted output y1 and the actual output x2.
Backpropagate the error through the network using the chain rule, updating the weights at
each time step.
Feed the second input x2 into the RNN and compute the output y2.
Compute the loss between the predicted output y2 and the actual output x3.
Backpropagate the error through the network again, updating the weights at each time step.
Repeat steps 5–7 for the entire sequence of inputs {x1, x2, x3, …, xn}.
Test the RNN on a separate validation set and adjust the hyperparameters as necessary.
During training, the weights of the RNN are updated based on the gradients computed by
backpropagating the errors through time. This allows the RNN to learn the temporal
dependencies in the data and make accurate predictions for the next value in the sequence.
Overall, BPTT is a powerful technique for training RNNs to model sequential data, and it has
been successfully applied to a wide range of applications in various fields.

Limitation of BPTT:
While backpropagation through time (BPTT) is a powerful technique for training recurrent
neural networks (RNNs), it has some limitations:
• Computational complexity: BPTT requires computing the gradient at each time step,
which can be computationally expensive for long sequences. This can lead to slow
training times and may require specialized hardware to train large-scale models.
• Vanishing gradients: BPTT is prone to the problem of vanishing gradients, where the
gradients become very small as they propagate back in time. This can make it
difficult to learn long-term dependencies, which are important for many sequential
data modeling tasks.
• Exploding gradients: On the other hand, BPTT is also prone to the problem of
exploding gradients, where the gradients become very large as they propagate back in
time. This can lead to unstable training and can cause the weights of the network to
become unbounded, resulting in NaN values.
• Memory limitations: BPTT requires storing the activations of each time step, which
can be memory-intensive for long sequences. This can limit the size of the sequence
that can be processed by the network.
• Difficulty in parallelization: BPTT is inherently sequential, which makes it difficult to
parallelize across multiple GPUs or machines. This can limit the scalability of the
training process.

Vanishing and Exploding Gradients


The phenomena of vanishing and exploding gradients are significant challenges encountered
in the training of deep neural networks, especially in architectures like Recurrent Neural
Networks (RNNs). They occur during the backpropagation process and can drastically affect
the learning efficiency of the network. Let's delve into how each of these phenomena works:

Department of CSE(Data Science Page No: 15/53


III Year I Sem Deep Learning

Vanishing Gradients
• Cause: Vanishing gradients occur when the gradients (derivatives of the loss function
with respect to the weights) become very small, exponentially decreasing as they are
propagated back through the layers during training.
• Mathematics: This often happens due to the use of certain activation functions like
the sigmoid or hyperbolic tangent (tanh), which squish input values into a very small
output range. Their derivatives are small, and when multiplied through many layers,
the gradients can diminish to the point of being insignificant (approaching zero).
• Effect: When gradients vanish, the weights of the network, especially in the earlier
layers, receive very little update, and the network stops learning or learns very slowly.
This is particularly problematic in deep networks with many layers.

Exploding Gradients
Cause: Exploding gradients happen when the gradients become excessively large. This can
occur when the values of the weights, the derivatives of the activation functions, or the input
data are large.
Mathematics: If the gradients grow exponentially as they are propagated back through the
network's layers, it results in very large gradient values.
Effect: Large gradients can cause the weights to be updated in too large of steps. This often
leads to an unstable network where the model's weights can oscillate, diverge, and fail to
converge to a solution.

Common in RNNs
• Both phenomena are particularly common in RNNs due to their sequential nature and
the repeated use of the same weights at each time step. This repeated multiplication
can exacerbate the vanishing or exploding of gradients.

Solutions
For Vanishing Gradients:
• Use of ReLU Activation Function: ReLU (Rectified Linear Unit) and its variants (like
Leaky ReLU) help in mitigating vanishing gradients as they do not saturate in the
positive domain.
• LSTM and GRU Networks: These architectures introduce gating mechanisms to
control the flow of information and gradients, mitigating the vanishing gradient
problem.
• Residual Connections: Used in Convolutional Neural Networks (CNNs), these
connections allow gradients to skip layers, reducing the vanishing effect.

For Exploding Gradients:


• Gradient Clipping: This technique involves scaling down gradients when they exceed
a certain threshold, preventing them from becoming too large.
• Weight Regularization: Applying regularization techniques to the weights can prevent
them from growing too large.

Department of CSE(Data Science Page No: 16/53


III Year I Sem Deep Learning

Truncated BPTT
Truncated Backpropagation Through Time (TBPTT) is a variation of the standard
Backpropagation Through Time (BPTT) algorithm, specifically designed to tackle some of the
practical challenges associated with training Recurrent Neural Networks (RNNs).
Challenges Addressed by TBPTT
Vanishing and Exploding Gradients: In standard BPTT, when gradients are propagated back
through many time steps, they can either vanish or explode, leading to training difficulties.
Computational Efficiency: Propagating gradients over long sequences in standard BPTT is
computationally demanding and memory-intensive.
How TBPTT Works
Sequence Segmentation: Instead of processing the entire input sequence in one go, TBPTT
divides the long input sequence into smaller subsequences.
Forward Pass:
• The network processes the input sequence, but only for a fixed number of time steps
(the length of the subsequence).
• It retains the hidden states at each time step within this subsequence.

Backward Pass:
• Gradients are calculated and backpropagated, but only for the same fixed number of
steps as in the forward pass.
• This means the network does not backpropagate through the entire sequence, hence
the term "truncated".

Weight Updates:
• After backpropagating through each subsequence, the weights of the network are
updated.
• The network then processes the next subsequence, using the final hidden state of the
previous subsequence as the initial hidden state for the new subsequence.
• Key Points of TBPTT
• Two Hyperparameters: TBPTT has two crucial hyperparameters: the number of time
steps for the forward pass and the number of time steps for the backward pass. These
can be the same or different.
• Trade-Offs: The choice of these hyperparameters involves a trade-off between
computational efficiency and the ability of the network to capture long-term
dependencies.
• Preserving Long-Term Dependencies: While TBPTT addresses the computational
issues of BPTT, it can still struggle to capture dependencies that span beyond the
length of the truncated sequences.

Applications
• TBPTT is widely used in training RNNs for tasks like language modeling, text
generation, and other scenarios where input sequences are lengthy.

Department of CSE(Data Science Page No: 17/53


III Year I Sem Deep Learning

Gated Recurrent Unit (GRU)


A Gated Recurrent Unit (GRU) is a Recurrent Neural Network (RNN) architecture type. Like
other RNNs, a GRU can process sequential data such as time series, natural language, and
speech.
Take a look at the following sentence:

"My mom gave me a bicycle on my birthday because she knew that I wanted to go biking with
my friends."

As we can see from the above sentence, words that affect each other can be further apart. For
example, "bicycle" and "go biking" are closely related but are placed further apart in the
sentence.
An RNN network finds tracking the state with such a long context difficult. It needs to find
out what information is important. However, a GRU cell greatly alleviates this problem.
GRU network was invented in 2014. It solves problems involving long sequences with contexts
placed further apart, like the above biking example. This is possible because of how the GRU
cell in the GRU architecture is built. Let us now delve deeper into the understanding and
working of the GRU network.

The Gated Recurrent Unit (GRU) cell is the basic building block of a GRU network. It
comprises three main components: an update gate, a reset gate, and a candidate hidden
state.

One of the key advantages of the GRU cell is its simplicity. Since it has fewer parameters than
a long short-term memory (LSTM) cell, it is faster to train and run and less prone to
overfitting.
Additionally, one thing to remember is that the GRU cell architecture is simple, the cell itself
is a black box, and the final decision on how much we should consider the past state and
how much should be forgotten is taken by this GRU cell. We need to look inside and
understand what the cell is thinking.
Architecture:
A GRU cell keeps track of the important information maintained throughout the network. A
GRU network achieves this with the following two gates:
• Reset Gate
• Update Gate.

Given below is the simplest architectural form of a GRU cell.

Department of CSE(Data Science Page No: 18/53


III Year I Sem Deep Learning

As shown below, a GRU cell takes two inputs:


• The previous hidden state
• The input in the current timestamp.

The cell combines these and passes them through the update and reset gates. To get the
output in the current timestep, we must pass this hidden state through a dense layer with
softmax activation to predict the output. Doing so, a new hidden state is obtained and then
passed on to the next time step.

Update Gate
An update gate determines what current GRU cell will pass information to the next GRU cell.
It helps in keeping track of the most important information.
Reset Gate
A reset gate identifies the unnecessary information and decides what information to be laid
off from the GRU network. Simply put, it decides what information to delete at the specific
timestamp.

Long Short Term Memory in short LSTM is a special kind of RNN capable of learning long
term sequences. They were introduced by Schmidhuber and Hochreiter in 1997. It is explicitly
designed to avoid long term dependency problems. Remembering the long sequences for a
long period of time is its way of working.

The popularity of LSTM is due to the Getting mechanism involved with each LSTM cell. In a
normal RNN cell, the input at the time stamp and hidden state from the previous time step
is passed through the activation layer to obtain a new state. Whereas in LSTM the process is
slightly complex, as you can see in the above architecture at each time it takes input from
three different states like the current input state, the short term memory from the previous
cell and lastly the long term memory.
These cells use the gates to regulate the information to be kept or discarded at loop operation
before passing on the long term and short term information to the next cell. We can imagine

Department of CSE(Data Science Page No: 19/53


III Year I Sem Deep Learning

these gates as Filters that remove unwanted selected and irrelevant information. There are
a total of three gates that LSTM uses as Input Gate, Forget Gate, and Output Gate.

Input Gate
The input gate decides what information will be stored in long term memory. It only works
with the information from the current input and short term memory from the previous step.
At this gate, it filters out the information from variables that are not useful.

Forget Gate
The forget decides which information from long term memory be kept or discarded and this
is done by multiplying the incoming long term memory by a forget vector generated by the
current input and incoming short memory.

Output Gate
The output gate will take the current input, the previous short term memory and newly
computed long term memory to produce new short term memory which will be passed on to
the cell in the next time step. The output of the current time step can also be drawn from this
hidden state.

Working of GRU
What is Gated Recurrent Unit or GRU?
The workflow of the Gated Recurrent Unit, in short GRU, is the same as the RNN but the
difference is in the operation and gates associated with each GRU unit. To solve the problem
faced by standard RNN, GRU incorporates the two gate operating mechanisms called
Update gate and Reset gate.

Update gate
The update gate is responsible for determining the amount of previous information that needs
to pass along the next state. This is really powerful because the model can decide to copy all
the information from the past and eliminate the risk of vanishing gradient.

Reset gate
The reset gate is used from the model to decide how much of the past information is needed
to neglect; in short, it decides whether the previous cell state is important or not.
First, the reset gate comes into action it stores relevant information from the past time step
into new memory content. Then it multiplies the input vector and hidden state with their

Department of CSE(Data Science Page No: 20/53


III Year I Sem Deep Learning

weights. Next, it calculates element-wise multiplication between the reset gate and previously
hidden state multiple. After summing up the above steps the non-linear activation function
is applied and the next sequence is generated.

Long Short-Term Memory (LSTMs)


LSTMs Long Short-Term Memory is a type of RNNs Recurrent Neural Network that can detain
long-term dependencies in sequential data. LSTMs are able to process and analyze sequential
data, such as time series, text, and speech. They use a memory cell and gates to control the
flow of information, allowing them to selectively retain or discard information as needed and
thus avoid the vanishing gradient problem that plagues traditional RNNs. LSTMs are widely
used in various applications such as natural language processing, speech recognition, and
time series forecasting.
There are three types of gates in an LSTM: the input gate, the forget gate, and the output
gate.
The input gate controls the flow of information into the memory cell. The forget gate controls
the flow of information out of the memory cell. The output gate controls the flow of information
out of the LSTM and into the output.
Three gates input gate, forget gate, and output gate are all implemented using sigmoid
functions, which produce an output between 0 and 1. These gates are trained using a
backpropagation algorithm through the network.
The input gate decides which information to store in the memory cell. It is trained to open
when the input is important and close when it is not.
The forget gate decides which information to discard from the memory cell. It is trained to
open when the information is no longer important and close when it is.
The output gate is responsible for deciding which information to use for the output of the
LSTM. It is trained to open when the information is important and close when it is not.

The gates in an LSTM are trained to open and close based on the input and the previous
hidden state. This allows the LSTM to selectively retain or discard information, making it
more effective at capturing long-term dependencies.

Architecture of LSTM:
An LSTM (Long Short-Term Memory) network is a type of RNN recurrent neural network that
is capable of handling and processing sequential data. The structure of an LSTM network
consists of a series of LSTM cells, each of which has a set of gates (input, output, and forget
gates) that control the flow of information into and out of the cell. The gates are used to
selectively forget or retain information from the previous time steps, allowing the LSTM to
maintain long-term dependencies in the input data.

The LSTM cell also has a memory cell that stores information from previous time steps and
uses it to influence the output of the cell at the current time step. The output of each LSTM
cell is passed to the next cell in the network, allowing the LSTM to process and analyze
sequential data over multiple time steps.

Department of CSE(Data Science Page No: 21/53


III Year I Sem Deep Learning

Applications:
Long Short-Term Memory (LSTM) is a highly effective Recurrent Neural Network (RNN) that
has been utilized in various applications. Here are a few well-known LSTM applications:
• Language Simulation: Language support vector machines (LSTMs) have been utilized
for natural language processing tasks such as machine translation, language
modeling, and text summarization. By understanding the relationships between
words in a sentence, they can be trained to construct meaningful and grammatically
correct sentences.
• Voice Recognition: LSTMs have been utilized for speech recognition tasks such as
speech-to-text-to-text-transcription and command recognition. They may be taught to
recognize patterns in speech and match them to the appropriate text.
• Sentiment Analysis: LSTMs can be used to classify text sentiment as positive,
negative, or neutral by learning the relationships between words and their associated
sentiments.
• Time Series Prediction: LSTMs can be used to predict future values in a time series
by learning the relationships between past values and future values.
• Video Analysis: LSTMs can be used to analyze video by learning the relationships
between frames and their associated actions, objects, and scenes.
• Handwriting Recognition: LSTMs can be used to recognize handwriting by learning
the relationships between images of handwriting and the corresponding text.

Encoder-Decoder Models
Encoder-Decoder models are a framework in machine learning, particularly in the field of
neural networks, designed to handle sequence-to-sequence tasks, where the input and
output are both sequences that may differ in length. This framework is widely used in
applications like machine translation, text summarization, and speech recognition.
The encoder-decoder architecture for recurrent neural networks is the standard neural
machine translation method that rivals and in some cases outperforms classical statistical
machine translation methods.
Encoder:
• Function: The encoder processes the input sequence and compresses the information
into a context vector, a fixed-length representation of the input sequence.
• Architecture: Typically consists of a stack of recurrent layers (like LSTM or GRU) in
complex tasks. In simpler tasks, it can be a single recurrent layer.
• Context Vector: This vector aims to encapsulate the essence of the input sequence
for the decoder to use.
• A stack of several recurrent units (LSTM or GRU cells for better performance) where
each accepts a single element of the input sequence, collects information for that
element and propagates it forward.
• In question-answering problem, the input sequence is a collection of all words from
the question. Each word is represented as x_i where i is the order of that word.

Department of CSE(Data Science Page No: 22/53


III Year I Sem Deep Learning

• The hidden states h_i are computed using the formula:

Decoder:
• Function: The decoder takes the context vector and generates the output sequence
from it. The generation of the output sequence is typically done one element at a time.
• Architecture: Also typically a recurrent neural network, it is designed to produce
sequences and can be of the same or different architecture as the encoder.
• Sequential Output Generation: The decoder often uses its previous output as part
of the input for generating the next element in the sequence.
• A stack of several recurrent units where each predicts an output y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces and
output as well as its own hidden state.
• In the question-answering problem, the output sequence is a collection of all words
from the answer. Each word is represented as y_i where i is the order of that word.
• Any hidden state h_i is computed using the formula:

• The output y_t at time step t is computed using the formula:

Encoder-Decoder Communication
• Context Vector: This is the crucial link between the encoder and decoder. It’s the
encoder’s final hidden state and is used as the initial hidden state of the decoder.
• Attention Mechanism: To overcome the limitation of encoding the entire input
sequence into a fixed-length context vector, the attention mechanism allows the
decoder to focus on different parts of the input sequence at each step of the output
generation.

Department of CSE(Data Science Page No: 23/53


III Year I Sem Deep Learning

Attention Mechanism
The attention mechanism in deep learning, was initially developed to enhance the encoder-
decoder model's efficiency in machine translation. This mechanism operates by selectively
focusing on the most pertinent elements of the input sequence, similar to how we might
concentrate on a single conversation amidst the noise of a crowded room.
Fundamentally, the attention mechanism is akin to our brain's neurological system, which
emphasizes relevant sounds while filtering out background distractions. In the realm of deep
learning, it allows neural networks to attribute varying levels of importance to different input
segments, significantly boosting their capability to capture essential information. This
process is crucial in tasks such as natural language processing (NLP), where attention aids
in aligning relevant parts of a source sentence during translation or question-answering
activities.

Attention mechanisms in deep learning are used to help the model focus on the most
relevant parts of the input when making a prediction. In many problems, the input data may
be very large and complex, and it can be difficult for the model to process all of it. Attention
mechanisms allow the model to selectively focus on the parts of the input that are most
important for making a prediction, and to ignore the less relevant parts.
Attention mechanisms were introduced as a way to address this limitation in these models.
In attention-based models, the model can selectively focus on certain parts of the input when
making a prediction.
Consider machine translation as an example, where a traditional seq2seq model would be
used. Seq2seq models are typically composed of two main components: an encoder and a
decoder.

• The encoder processes the input sequence and represents it as a fixed-length vector
(context vector), which is then passed to the decoder.
• The decoder uses this fixed-length context vector to generate the output sequence.

Department of CSE(Data Science Page No: 24/53


III Year I Sem Deep Learning

The encoder and decoder networks are recurrent neural networks like GRUs and LSTMs.
The attention mechanism allows the model to "pay attention" to certain parts of the data
and to give them more weight when making predictions.

In a nutshell, the attention mechanism helps preserve the context of every word in a sentence
by assigning an attention weight relative to all other words. This way, even if the sentence is
large, the model can preserve the contextual importance of each word.

For example, in natural language processing tasks such as language translation, the
attention mechanism can help the model to understand the meaning of words in context.
Instead of just processing each word individually, the attention mechanism allows the
model to consider the words about the other words in the sentence, which can help it
understand its overall meaning.

We can implement the attention mechanism in many different ways, but one common
approach is to use a neural network to learn which parts of the data are the most relevant.
This network is trained to pay attention to the data's most important parts and give them
more weight when making predictions.

Overall, the attention mechanism is a powerful tool for improving the performance of
sequence models. By allowing the model to focus on the most relevant information, the
attention mechanism can help to improve the accuracy of predictions and to make the model
more efficient by only processing the most important data. As deep learning advances, we
might see even more sophisticated applications of the attention mechanism.
How Does the Attention Mechanism Work?
Let's consider a machine translation example where x denotes the source sentence with a
length of n and y denotes the target sequence length with a length of m.

For a bi-directional sequence model which could be used for the task, there will have two
hidden states, the forward and the backward hidden states. In Bahdanau et al., 2015, a
simple concatenation of these two hidden states represents the encoder state. That way, both
preceding and following words can be used to compute the attention of any word in the input.

Department of CSE(Data Science Page No: 25/53


III Year I Sem Deep Learning

The decoder network's hidden state is,

Where t denotes the length of the sequence and the ct, is the context vector (of each output
yt), which is nothing but a sum of hidden states of the input sequence hi, weighted by
alignment scores.

Now how is this alignment score that acts as the weight calculated? The alignment score is
parameterized by a single feed-forward neural network which is trained along with other parts
of the model.

Attention Over Images


Attention Mechanism: A process in neural networks that allows the model to focus on
specific parts of the input data that are more relevant to the task at hand, similar to human
attention.
Attention Over Images: This is the application of attention mechanisms specifically to image
data, enabling the model to selectively concentrate on certain regions of an image for
processing.

Key Concepts and Operations


• Selective Focus: The model dynamically assigns higher importance to certain regions
of an image while processing, based on the context of the task.
• Contextual Relevance: It understands the relevance of different parts of an image in
relation to the entire image or the task being performed.

• Weight Assignment: The attention model assigns weights to different regions of the
image, indicating the level of focus each region should receive.

Applications
• Image Captioning: In generating captions for images, the model focuses on specific
objects or areas in the image that are relevant to the text being generated.
• Medical Imaging: Attention mechanisms help in identifying specific areas in medical
images, such as tumors in MRI scans, by focusing on relevant sections.
• Object Detection and Recognition: Enhances the accuracy of models by
concentrating on regions where objects are located.

Department of CSE(Data Science Page No: 26/53


III Year I Sem Deep Learning

Unit V
Topics to be Discussed:
Deep Generative Models: Deep Belief Networks, Restricted Boltzmann Machines (RBMs),
Generative Adversarial Networks (GANs), Autoencoders.
Transfer Learning: Approaches in Transfer Learning, Transfer Learning with Inception
Model.

Deep Generative Models

Deep Belief Networks


Deep belief networks (DBNs) are a type of deep learning algorithms that addresses the
problems associated with classic neural networks. They do this by using layers of stochastic
latent variables, which make up the network. These binary latent variables, or feature
detectors and hidden units, are binary variables, and they are known as stochastic because
they can take on any value within a specific range with some probability.
The top two layers in DBNs have no direction, but the layers above them have directed links
to lower layers. DBNs differ from traditional neural networks because they can be generative
and discriminative models. For example, you can only train a conventional neural network to
classify images.
DBNs also differ from other deep learning algorithms like restricted Boltzmann machines
(RBMs) or autoencoders because they don't work with raw inputs like RBMs. Instead, they
rely on an input layer with one neuron per input vector and then proceed through many
layers until reaching a final layer where outputs are generated using probabilities derived
from previous layers' activations!
Deep Belief Networks (DBNs) are a class of deep neural networks with multiple layers of
hidden units. They are composed of simpler, unsupervised networks such as Restricted
Boltzmann Machines (RBMs). DBNs are probabilistic generative models, meaning they can
generate new data points when trained on a set of data.
DBNs are significant for their ability to perform unsupervised learning efficiently, making
them useful for feature extraction in large and complex datasets. They can be trained one
layer at a time, which was a breakthrough in training deep architectures effectively. DBNs
are adaptable to both generative and discriminative tasks, making them versatile in various
applications.

Consider stacking numerous RBMs so that the outputs of the first RBM serve as the input
for the second RBM, and so forth. Deep Belief Networks are the name given to these
networks. Each layer's connections are undirected (as each layer is an RBM). Those
between the strata are simultaneously directed (except for the top two layers – whose
connections are undirected). The DBNs can be trained in two different ways:
• Greedy Layer-wise Training Algorithm: RBMs are trained using a greedy layer-by-
layer training algorithm. The orientation between the DBN layers is established as

Department of CSE(Data Science Page No: 27/53


III Year I Sem Deep Learning

soon as the individual RBMs have been trained (i.e., the parameters, weights, and
biases, have been defined).
• Wake-sleep Algorithm: The DBN is trained from the bottom up using a wake-sleep
algorithm (connections going up indicate wake), and then from the bottom up using
connections indicating sleep.

In order to ensure that the layer connections only work downwards, we stack the RBMs, train
them, and then do so (except for the top two layers).
Emergence of DBNs:
Perceptrons, the first generation of neural networks, are incredibly powerful. You can use
them to identify an object in an image or tell you how much you like a particular food based
on your reaction. But they're limited. They typically only consider one piece of information
at a time and can't believe the context of what's happening around them.
To address these problems, we need to get creative! And that's where second-generation
neural networks come in. Backpropagation is a method that compares the received output
with the desired outcome and reduces the error value until it reaches zero, which means that
each perceptron will eventually get its optimal state.
The next step is directed acyclic graphs (DAGs), also known as belief networks, which aid in
solving inference and learning problems. Giving us more power over our data than ever before!
Finally, we can use deep belief networks (DBNs) to help construct fair values that we can
store in leaf nodes, meaning that no matter what happens along the way, we'll always have
an accurate answer right at our fingertips!

Architecture of DBN:
In the DBN, we have a hierarchy of layers. The top two layers are the associative memory,
and the bottom layer is the visible units. The arrows pointing towards the layer closest to the
data point to relationships between all lower layers.
Directed acyclic connections in the lower layers translate associative memory to observable
variables.
The lowest layer of visible units receives input data as binary or actual data. Like RBM,
there are no intralayer connections in DBN. The hidden units represent features that
encapsulate the data’s correlations.
A matrix of proportional weights W connects two layers. We’ll link every unit in each layer to
every other unit in the layer above it.

Department of CSE(Data Science Page No: 28/53


III Year I Sem Deep Learning

Working Methodology of DBN:


First, we train a property layer that can directly gain pixel input signals. Then we learn the
features of the preliminarily attained features by treating the values of this subcaste as pixels.
The lower bound on the log-liability of the training data set improves every time a fresh
subcaste of parcels or features that we add to the network.
The deep belief network's operational pipeline is as follows:
• First, we run numerous steps of Gibbs sampling in the top two hidden layers. The
top two hidden layers define the RBM. Thus, this stage effectively extracts a sample
from it.
• Then generate a sample from the visible units using a single pass of ancestral
sampling through the rest of the model.
• Finally, we’ll use a single bottom-up pass to infer the values of the latent variables in
each layer. In the bottom layer, greedy pretraining begins with an observed data
vector. It then oppositely fine-tunes the generative weights.

Applications:
• Image Recognition and Processing: DBNs are particularly effective in recognizing
patterns and features in images. They are used for tasks like object recognition, facial
recognition, and handwriting recognition. For instance, they can be trained to identify
specific objects within images or to recognize different styles of handwritten digits.
• Speech Recognition: In the field of audio processing, DBNs are used to identify
patterns in audio signals, making them suitable for speech recognition tasks. They
can learn features from raw audio data and are used in systems that convert spoken
language into text or perform speaker identification.
• Video Recognition: DBNs can process video data to recognize patterns over time,
making them useful in motion capture data analysis and video classification. They
can be used to track movements in videos or to recognize specific actions.
• Recommender Systems: DBNs can be utilized in collaborative filtering to recommend
products to users based on their past preferences. They are capable of learning
complex user-item interactions from large datasets, making them effective for
personalized recommendation systems.

Department of CSE(Data Science Page No: 29/53


III Year I Sem Deep Learning

• Natural Language Processing (NLP): In NLP, DBNs are used for various tasks like
topic modeling, sentiment analysis, and language modeling. They can learn to
understand the structure and nuances of language, aiding in the processing of large
text datasets.
• Bioinformatics: DBNs find applications in bioinformatics for tasks such as gene
expression pattern recognition, protein structure prediction, and DNA sequence
analysis. Their ability to learn complex patterns in biological data makes them
valuable in this field.
• Financial Modeling: In the finance sector, DBNs are used for predicting stock market
trends, analyzing risk, and detecting fraudulent activities. They can process large
volumes of financial data to uncover underlying patterns and trends.
• Healthcare and Medical Diagnosis: DBNs are used in medical image analysis for
tasks like disease detection, medical image classification, and patient diagnosis.
They can analyze medical images like X-rays, MRIs, and CT scans to assist in
diagnosis and treatment planning.

Restricted Boltzmann Machines (RBMs)


Deep Boltzmann Machines (DBMs)
DBMs also have undirected connections between the layers in addition to the connections
inside the levels (unlike DBN, in which the layer connections are directed). DBMs can be
utilized for more challenging jobs since they can extract more sophisticated or complex
features.

Restricted Boltzmann Machine is an undirected graphical model that plays a major role in
Deep Learning Framework in recent times. It was initially introduced as Harmonium by Paul
Smolensky in 1986 and it gained big popularity in recent years in the context of the Netflix
Prize where Restricted Boltzmann Machines achieved state of the art performance in
collaborative filtering and have beaten most of the competition.
It is an algorithm which is useful for dimensionality reduction, classification, regression,
collaborative filtering, feature learning, and topic modeling.
RBM shares a similar idea, but it uses stochastic units with particular distribution instead
of deterministic distribution. The task of training is to find out how these two sets of variables
are actually connected to each other.
One aspect that distinguishes RBM from other autoencoders is that it has two biases.
The hidden bias helps the RBM produce the activations on the forward pass, while
The visible layer’s biases help the RBM learn the reconstructions on the backward pass.

Restricted Boltzmann Machines are shallow, two-layer neural nets that constitute the
building blocks of deep-belief networks. The first layer of the RBM is called the visible, or
input layer, and the second is the hidden layer. Each circle represents a neuron-like unit
called a node. The nodes are connected to each other across layers, but no two nodes of the
same layer are linked.

Department of CSE(Data Science Page No: 30/53


III Year I Sem Deep Learning

The restriction in a Restricted Boltzmann Machine is that there is no intra-layer


communication. Each node is a locus of computation that processes input and begins by
making stochastic decisions about whether to transmit that input or not.

Working of Restricted Boltzmann Machine


Each visible node takes a low-level feature from an item in the dataset to be learned. At node
1 of the hidden layer, x is multiplied by a weight and added to a bias. The result of those two
operations is fed into an activation function, which produces the node’s output, or the
strength of the signal passing through it, given input x.

Next, let’s look at how several inputs would combine at one hidden node. Each x is multiplied
by a separate weight, the products are summed, added to a bias, and again the result is
passed through an activation function to produce the node’s output.

Department of CSE(Data Science Page No: 31/53


III Year I Sem Deep Learning

At each hidden node, each input x is multiplied by its respective weight w. That is, a single
input x would have three weights here, making 12 weights altogether (4 input nodes x 3
hidden nodes). The weights between the two layers will always form a matrix where the rows
are equal to the input nodes, and the columns are equal to the output nodes.

Each hidden node receives the four inputs multiplied by their respective weights. The sum of
those products is again added to a bias (which forces at least some activations to happen),
and the result is passed through the activation algorithm producing one output for each
hidden node.
Training of Restricted Boltzmann Machine
The training of the Restricted Boltzmann Machine differs from the training of regular neural
networks via stochastic gradient descent.
The Two main Training steps are:
Gibbs Sampling
The first part of the training is called Gibbs Sampling. Given an input vector v we use p(h|
v)for prediction of the hidden values h. Knowing the hidden values we use p(v|h) :

for prediction of new input values v. This process is repeated k times. After k iterations, we
obtain another input vector v_k which was recreated from original input values v_0.

Department of CSE(Data Science Page No: 32/53


III Year I Sem Deep Learning

The analysis of hidden factors is performed in a binary way, i.e, the user only tells if they
liked (rating 1) a specific movie or not (rating 0) and it represents the inputs for the
input/visible layer. Given the inputs, the RMB then tries to discover latent factors in the data
that can explain the movie choices and each hidden neuron represents one of the latent
factors.

Let us consider the following example where a user likes Lord of the Rings and Harry Potter
but does not like The Matrix, Fight Club and Titanic. The Hobbit has not been seen yet so it
gets a -1 rating. Given these inputs, the Boltzmann Machine may identify three hidden factors
Drama, Fantasy and Science Fiction which correspond to the movie genres.

After the training phase, the goal is to predict a binary rating for the movies that had not
been seen yet. Given the training data of a specific user, the network is able to identify the
latent factors based on the user’s preference and sample from Bernoulli distribution can be
used to find out which of the visible neurons now become active.

Generative Adversarial Networks (GANs)


Generative Adversarial Networks (GANs) were introduced in 2014 by Ian J. Goodfellow and
co-authors. GANs perform unsupervised learning tasks in machine learning. It consists of 2
models that automatically discover and learn the patterns in input data.
The two models are known as Generator and Discriminator.
They compete with each other to scrutinize, capture, and replicate the variations within a
dataset. GANs can be used to generate new examples that plausibly could have been drawn
from the original dataset.

Figure Shown below is an example of a GAN. There is a database that has real 100 rupee
notes. The generator neural network generates fake 100 rupee notes. The discriminator
network will help identify the real and fake notes.

Department of CSE(Data Science Page No: 33/53


III Year I Sem Deep Learning

Generator:
A Generator in GANs is a neural network that creates fake data to be trained on the
discriminator. It learns to generate plausible data. The generated examples/instances become
negative training examples for the discriminator. It takes a fixed-length random vector
carrying noise as input and generates a sample.

The main aim of the Generator is to make the discriminator classify its output as real. The
part of the GAN that trains the Generator includes:
• Noisy Input Vector
• Generator network, which transforms the random input into a data instance
• Discriminator network, which classifies the generated data
• Generator loss, which penalizes the Generator for failing to dolt the discriminator

The backpropagation method is used to adjust each weight in the right direction by
calculating the weight's impact on the output. It is also used to obtain gradients and these
gradients can help change the generator weights.

Department of CSE(Data Science Page No: 34/53


III Year I Sem Deep Learning

Discriminator:
The Discriminator is a neural network that identifies real data from the fake data created by
the Generator. The discriminator's training data comes from different two sources:
The real data instances, such as real pictures of birds, humans, currency notes, etc., are
used by the Discriminator as positive samples during training.
The fake data instances created by the Generator are used as negative examples during the
training process.

While training the discriminator, it connects to two loss functions. During discriminator
training, the discriminator ignores the generator loss and just uses the discriminator loss.
In the process of training the discriminator, the discriminator classifies both real data and
fake data from the generator. The discriminator loss penalizes the discriminator for
misclassifying a real data instance as fake or a fake data instance as real.
The discriminator updates its weights through backpropagation from the discriminator loss
through the discriminator network.

GANs consists of two neural networks. There is a Generator G(x) and a Discriminator D(x).
Both of them play an adversarial game. The generator's aim is to fool the discriminator by
producing data that are similar to those in the training set. The discriminator will try not to
be fooled by identifying fake data from real data. Both of them work simultaneously to learn
and train complex data like audio, video, or image files.

Department of CSE(Data Science Page No: 35/53


III Year I Sem Deep Learning

The Generator network takes a sample and generates a fake sample of data. The Generator
is trained to increase the Discriminator network's probability of making mistakes.

Below is an example of a GAN trying to identify if the 100 rupee notes are real or fake. So,
first, a noise vector or the input vector is fed to the Generator network. The generator creates
fake 100 rupee notes. The real images of 100 rupee notes stored in a database are passed to
the discriminator along with the fake notes. The Discriminator then identifies the notes as
classifying them as real or fake.
We train the model, calculate the loss function at the end of the discriminator network, and
backpropagate the loss into both discriminator and generator models.

The mathematical equation for training a GAN can be represented as:


Here,
G = Generator
D = Discriminator
Pdata(x) = distribution of real data
p(z) = distribution of generator
x = sample from Pdata(x)
z = sample from P(z)
D(x) = Discriminator network
G(z) = Generator network
With this understanding, let’s learn the next topic on what are GANs, i.e. training a GAN.

Department of CSE(Data Science Page No: 36/53


III Year I Sem Deep Learning

Applications
Image Generation and Synthesis: One of the most popular applications of GANs is in
generating photorealistic images. This includes creating artwork, generating faces of non-
existent people, and synthesizing scenes for movies or video games.
• Data Augmentation: In machine learning, having a large and varied dataset is crucial
for training robust models. GANs can augment existing datasets by generating new,
synthetic samples, which is especially useful when data collection is challenging or
expensive.
• Style Transfer: GANs can modify the style of an image while retaining its content, such
as turning a daytime photo into a nighttime one, or converting photographs into the
style of famous paintings.
• Super-Resolution: GANs are used to increase the resolution of images (known as
super-resolution), enhancing the quality of low-resolution images by filling in missing
details in a realistic manner.
• Medical Image Analysis: In healthcare, GANs are used for generating medical images
for training and research purposes, improving the quality of medical scans, and even
assisting in the creation of 3D models of organs.
• Drug Discovery: GANs can be used to generate molecular structures for new potential
drugs, accelerating the drug discovery process by predicting properties of drug-like
compounds.
• Text-to-Image Synthesis: GANs can convert textual descriptions into corresponding
images, which has applications in areas like content creation and assisting artists.
• Face Aging and De-aging: They are capable of realistically altering facial features to
show how a person might look at different ages, which has applications in fields like
entertainment and finding missing persons.
• Voice Generation: Although primarily visual, GANs can also be adapted for audio
purposes, such as generating realistic human speech or music composition.
• Fashion and Design: In the fashion industry, GANs are used for designing clothing,
generating new fashion styles, and virtual try-ons.
• Video Generation and Editing: GANs can create realistic video footage and are used in
video editing to alter or generate video scenes, which has implications in filmmaking
and content creation.
• Anomaly Detection: In sectors like cybersecurity and quality control, GANs are
employed for anomaly detection by learning to generate normal operation data and
identifying deviations from this norm.

Autoencoders
Autoencoders are a type of deep learning algorithm that are designed to receive an input
and transform it into a different representation. They play an important part in image
construction.
Autoencoders are very useful in the field of unsupervised machine learning. You can use
them to compress the data and reduce its dimensionality.

Department of CSE(Data Science Page No: 37/53


III Year I Sem Deep Learning

The main difference between Autoencoders and Principle Component Analysis (PCA) is that
while PCA finds the directions along which you can project the data with maximum variance,
Autoencoders reconstruct our original input given just a compressed version of it.
If anyone needs the original data can reconstruct it from the compressed data using an
autoencoder.

Architecture
An Autoencoder is a type of neural network that can learn to reconstruct images, text, and
other data from compressed versions of themselves.
An Autoencoder consists of three layers:
• Encoder
• Code
• Decoder

The Encoder layer compresses the input image into a latent space representation. It
encodes the input image as a compressed representation in a reduced dimension.
The compressed image is a distorted version of the original image.
The Code layer represents the compressed input fed to the decoder layer.
The decoder layer decodes the encoded image back to the original dimension. The decoded
image is reconstructed from latent space representation, and it is reconstructed from the
latent space representation and is a lossy reconstruction of the original image.

Training Autoencoders
First, the code or bottleneck size is the most critical hyperparameter to tune the autoencoder.
It decides how much data has to be compressed. It can also act as a regularisation term.
Secondly, it's important to remember that the number of layers is critical when tuning
autoencoders. A higher depth increases model complexity, but a lower depth is faster to
process.
Thirdly, you should pay attention to how many nodes you use per layer. The number of nodes
decreases with each subsequent layer in the autoencoder as the input to each layer becomes
smaller across the layers.
Finally, it's worth noting that there are two famous losses for reconstruction: MSE Loss and
L1 Loss.
Types of Autoencoders
Under Complete Autoencoders
Under complete autoencoders is an unsupervised neural network that you can use to
generate a compressed version of the input data.
It is done by taking in an image and trying to predict the same image as output, thus
reconstructing the image from its compressed bottleneck region.
The primary use for autoencoders like these is generating a latent space or bottleneck, which
forms a compressed substitute of the input data and can be easily decompressed back with
the help of the network when needed.

Department of CSE(Data Science Page No: 38/53


III Year I Sem Deep Learning

Sparse Autoencoders
Sparse autoencoders are controlled by changing the number of nodes at each hidden layer.
Since it is impossible to design a neural network with a flexible number of nodes at its hidden
layers, sparse autoencoders work by penalizing the activation of some neurons in hidden
layers.
It means that a penalty directly proportional to the number of neurons activated is applied to
the loss function.
As a means of regularizing the neural network, the sparsity function prevents more neurons
from being activated.

There are two types of regularizers used:


• The L1 Loss method is a general regularizer we can use to add magnitude to the model.
• The KL-divergence method considers the activations over a collection of samples at
once rather than summing them as in the L1 Loss method. We constrain the average
activation of each neuron over this collection.

Contractive Autoencoders
The input is passed through a bottleneck in a contractive autoencoder and then
reconstructed in the decoder. The bottleneck function is used to learn a representation of the
image while passing it through.
The contractive autoencoder also has a regularization term to prevent the network from
learning the identity function and mapping input into output.
To train a model that works along with this constraint, we need to ensure that the derivatives
of the hidden layer activations are small concerning the input.

Denoising Autoencoders
Denoising autoencoders are similar to regular autoencoders in that they take an input and
produce an output. However, they differ because they don't have the input image as their
ground truth. Instead, they use a noisy version.
It is because removing image noise is difficult when working with images.
You'd have to do it manually. But with a denoising autoencoder, we feed the noisy idea into
our network and let it map it into a lower-dimensional manifold where filtering out noise
becomes much more manageable.
The loss function usually used with these networks is L2 or L1 loss.

Variational Autoencoders
Variational autoencoders (VAEs) are models that address a specific problem with standard
autoencoders. When you train an autoencoder, it learns to represent the input just in a
compressed form called the latent space or the bottleneck. However, this latent space formed
after training is not necessarily continuous and, in effect, might not be easy to interpolate.

Department of CSE(Data Science Page No: 39/53


III Year I Sem Deep Learning

Variational autoencoders deal with this specific topic and express their latent attributes as
a probability distribution, forming a continuous latent space that can be easily sampled and
interpolated.

Applications of Autoencoders:
Autoencoders, a type of neural network architecture, are widely used for their ability to learn
efficient representations of data. Here are some of their key applications:
• Dimensionality Reduction: Autoencoders can reduce the dimensionality of data by
learning a compressed representation in the hidden layers. This is similar to PCA but
with non-linear transformations, making it more powerful for complex datasets.
• Feature Learning: They can learn to encode the input into a set of features in an
unsupervised manner. This feature learning is useful in pretraining neural networks,
especially when labeled data is scarce.
• Denoising: Denoising autoencoders are designed to remove noise from data. They are
trained to reconstruct a clean version from a noisy input, learning to capture the
important features while ignoring the noise.
• Anomaly Detection: Autoencoders can be used to detect anomalies in data. They learn
to represent normal data well, but they will struggle to reconstruct anomalies, which
can then be detected by measuring the reconstruction error.
• Image Processing: In image processing, autoencoders are used for tasks like image
denoising, super-resolution, and colorization. They can, for instance, take a black and
white image and produce a colorized version of it.
• Data Generation: Variational autoencoders (VAEs), a variant, can generate new data
similar to the input data. This is used in creating new images, music, or text that
resemble the training dataset.
• Sequence-to-sequence Modeling: Autoencoders can be adapted for sequence-to-
sequence tasks, such as in natural language processing for tasks like sentence
encoding and machine translation.
• Representation Learning for Text: They can learn efficient representations of text,
which can be used in various NLP tasks such as sentiment analysis, topic modeling,
or document clustering.
• Drug Discovery: In bioinformatics, autoencoders can be used for encoding molecular
structures and predicting the properties of new molecules, aiding in the drug
discovery process.
• Recommender Systems: Autoencoders can be used to learn user preferences and
make recommendations. They are particularly good at handling sparse data, which
is common in recommendation systems.

Department of CSE(Data Science Page No: 40/53


III Year I Sem Deep Learning

Transfer Learning
In transfer learning, the knowledge of an already trained Transfer Learning model is applied
to a different but related problem. For example, if you trained a simple classifier to predict
whether an image contains a backpack, you could use the knowledge that the model gained
during its training to recognize other objects like sunglasses.
With transfer learning, we basically try to exploit what has been learned in one task to
improve generalization in another. We transfer the weights that a network has learned at
“task A” to a new “task B.”
The general idea is to use the knowledge a model has learned from a task with a lot of available
labeled training data in a new task that doesn't have much data. Instead of starting the
learning process from scratch, we start with patterns learned from solving a related task.
Transfer learning is mostly used in computer vision and natural language processing tasks
like sentiment analysis due to the huge amount of computational power required.
Transfer learning isn’t really a Transfer Learning technique, but can be seen as a “design
methodology” within the field, for example, active learning. It is also not an exclusive part or
study-area of Transfer Learning. Nevertheless, it has become quite popular in combination
with neural networks that require huge amounts of data and computational power.

How Transfer Learning Works


In computer vision, for example, neural networks usually try to detect edges in the earlier
layers, shapes in the middle layer and some task-specific features in the later layers. In
transfer learning, the early and middle layers are used and we only retrain the latter layers.
It helps leverage the labeled data of the task it was initially trained on.
Let’s go back to the example of a model trained for recognizing a backpack on an image, which
will be used to identify sunglasses. In the earlier layers, the model has learned to recognize
objects, because of that we will only retrain the latter layers so it will learn what separates
sunglasses from other objects.

Department of CSE(Data Science Page No: 41/53


III Year I Sem Deep Learning

In transfer learning, we try to transfer as much knowledge as possible from the previous task
the model was trained on to the new task at hand. This knowledge can be in various forms
depending on the problem and the data. For example, it could be how models are composed,
which allows us to more easily identify novel objects.

Why Transfer Learning is Used


Transfer learning has several benefits, but the main advantages are saving training time,
better performance of neural networks (in most cases), and not needing a lot of data.
Usually, a lot of data is needed to train a neural network from scratch but access to that data
isn't always available — this is where transfer learning comes in handy. With transfer learning
a solid Transfer Learning model can be built with comparatively little training data because
the model is already pre-trained. This is especially valuable in natural language processing
because mostly expert knowledge is required to create large labeled data sets. Additionally,
training time is reduced because it can sometimes take days or even weeks to train a deep
neural network from scratch on a complex task.

Approaches in Transfer Learning


Training A Model To Reuse It
Imagine you want to solve task A but don’t have enough data to train a deep neural
network. One way around this is to find a related task B with an abundance of data. Train
the deep neural network on task B and use the model as a starting point for solving task A.
Whether you'll need to use the whole model or only a few layers depends heavily on the
problem you're trying to solve.
If you have the same input in both tasks, possibly reusing the model and making predictions
for your new input is an option. Alternatively, changing and retraining different task-specific
layers and the output layer is a method to explore.

Using A Pre-Trained Model


The second approach is to use an already pre-trained model. There are a lot of these models
out there, so make sure to do a little research. How many layers to reuse and how many to
retrain depends on the problem.
Keras, for example, provides numerous pre-trained models that can be used for transfer
learning, prediction, feature extraction and fine-tuning. You can find these models, and also
some brief tutorials on how to use them, here. There are also many research institutions
that release trained models.
This type of transfer learning is most commonly used throughout deep learning.

Feature Extraction
Another approach is to use deep learning to discover the best representation of your problem,
which means finding the most important features. This approach is also known as
representation learning, and can often result in a much better performance than can be
obtained with hand-designed representation.

Department of CSE(Data Science Page No: 42/53


III Year I Sem Deep Learning

In Transfer Learning, features are usually manually hand-crafted by researchers and domain
experts. Fortunately, deep learning can extract features automatically. Of course, this doesn't
mean feature engineering and domain knowledge isn’t important anymore — you still have
to decide which features you put into your network. That said, neural networks have the
ability to learn which features are really important and which ones aren’t. A representation
learning algorithm can discover a good combination of features within a very short timeframe,
even for complex tasks which would otherwise require a lot of human effort.
The learned representation can then be used for other problems as well. Simply use the first
layers to spot the right representation of features, but don’t use the output of the network
because it is too task-specific. Instead, feed data into your network and use one of the
intermediate layers as the output layer. This layer can then be interpreted as a representation
of the raw data.
This approach is mostly used in computer vision because it can reduce the size of your
dataset, which decreases computation time and makes it more suitable for traditional
algorithms, as well.

Transfer Learning with Inception Model


The Inception Model, also known as GoogLeNet, is a deep convolutional neural network
architecture that was introduced by Google. It's known for its efficiency and high accuracy,
particularly in image classification tasks.
Design Philosophy: The Inception Model employs a 'network within a network' design with
multiple convolutional layers and pooling layers within each block, leading to improved
computational efficiency and reduced overfitting.
Key Features of the Inception Model
• Multiple Convolutional Filters: Uses various-sized filters (like 1x1, 3x3, 5x5) in the
same layer to capture features at different scales.
• Dimensionality Reduction: Employs 1x1 convolutions to reduce dimensionality,
helping in managing computational resources.
• Inception Modules: Contains modules that perform several convolutions in parallel
and concatenate the outputs.

Department of CSE(Data Science Page No: 43/53


III Year I Sem Deep Learning

Transfer Learning with Inception


• Pre-Trained Models: Inception models pre-trained on large datasets such as
ImageNet are readily available. These models have learned robust feature
representations for a wide range of images.
• Adaptation for New Tasks:
o Feature Extraction: The lower and mid-level layers of the Inception Model,
which have learned to identify general features, can be reused for new tasks.
These layers are typically frozen during training.
o Fine-Tuning: The higher layers, closer to the output, can be fine-tuned for the
specific new task by continuing the training process. This may involve
replacing the top layers of the network with new layers tailored to the new
task.

Process of Transfer Learning with Inception


• Select a Pre-Trained Inception Model: Obtain a version of the Inception model that
has been pre-trained on a comprehensive dataset.
• Customize for Target Task: Depending on the similarity of the new task to the
original one, adjust the architecture as needed. This often involves modifying or
replacing the final fully connected layers.
• Freeze Early Layers: Prevent the weights of the early layers from being updated
during training. This preserves the generic features learned.
• Fine-Tune Later Layers: Train the later layers on the new dataset, allowing them to
learn features specific to the new task.
• Optimize and Train: Optimize hyperparameters for the new task and train the
modified model on the new dataset.

Applications
• Custom Image Classification: Adapting the Inception model for specialized image
classification tasks, like identifying specific types of objects or diseases in medical
images.
• Object Localization and Detection: Modifying the model for detecting the location
and type of multiple objects within an image.
• Image Segmentation: Adapting for segmentation tasks, where the goal is to partition
the image into segments representing different objects or regions.

Department of CSE(Data Science Page No: 44/53

You might also like