Deep learning(UNIT 3) BY Ms SURBHI SAROHA

 DIMENTIONALITY REDUCTION :
 Linear (PCA, LDA) and manifolds,
 metric learning – Auto encoders and dimensionality reduction in networks
 - Introduction to Convnet - Architectures –AlexNet, VGG, Inception, ResNet
 -Training a Convnet: weights initialization,
 batch normalization,
 hyperparameter optimization
Shobhit Institute of Engineering and Technology (NAAC 'A' Grade Accredited Deemed to be University)

 Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible.
 This can be done for a variety of reasons, such as to reduce the complexity of a
model, to improve the performance of a learning algorithm, or to make it easier to
visualize the data.
 There are several techniques for dimensionality reduction, including principal
component analysis (PCA), singular value decomposition (SVD), and linear
discriminant analysis (LDA).
 Each technique uses a different method to project the data onto a lower-
dimensional space while preserving important information.

 Dimensionality reduction is a technique used to reduce the number of features in
a dataset while retaining as much of the important information as possible.
 In other words, it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
 In machine learning, high-dimensional data refers to data with a large number of
features or variables.
 The curse of dimensionality is a common problem in machine learning, where the
performance of the model deteriorates as the number of features increases.
 This is because the complexity of the model increases with the number of features,
and it becomes more difficult to find a good solution.
 In addition, high-dimensional data can also lead to overfitting, where the model
fits the training data too closely and does not generalize well to new data.

 Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance. There are
two main approaches to dimensionality reduction: feature selection and feature
extraction.
 Feature Selection:
Feature selection involves selecting a subset of the original features that are most
relevant to the problem at hand. The goal is to reduce the dimensionality of the
dataset while retaining the most important features. There are several methods
for feature selection, including filter methods, wrapper methods, and embedded
methods. Filter methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria for selecting
features, and embedded methods combine feature selection with the model
training process.

 Feature Extraction:
Feature extraction involves creating new features by combining or transforming
the original features.
 The goal is to create a set of features that captures the essence of the original
data in a lower-dimensional space.
 There are several methods for feature extraction, including principal component
analysis (PCA), linear discriminant analysis (LDA), and t-distributed stochastic
neighbor embedding (t-SNE). PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving as much of the variance
as possible.

 There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It
usually involves three ways:
 Filter
 Wrapper
 Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.

 This method was introduced by Karl Pearson.
 It works on the condition that while the data in a higher dimensional space is
mapped to data in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.


 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or 3D,
which can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in machine
learning models, which can lead to poor generalization performance.
Dimensionality reduction can help in reducing the complexity of the data, and
hence prevent overfitting.

 Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection for
machine learning models.
 Data Preprocessing: Dimensionality reduction can be used as a preprocessing step
before applying machine learning algorithms to reduce the dimensionality of the
data and hence improve the performance of the model.
 Improved Performance: Dimensionality reduction can help in improving the
performance of machine learning models by reducing the complexity of the data,
and hence reducing the noise and irrelevant information in the data.

 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb rules are
applied.
 Interpretability: The reduced dimensions may not be easily interpretable, and it may be difficult
to understand the relationship between the original features and the reduced dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to overfitting, especially when the
number of components is chosen based on the training data.
 Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to outliers, which
can result in a biased representation of the data.
 Computational complexity: Some dimensionality reduction techniques, such as manifold
learning, can be computationally intensive, especially when dealing with large datasets.

 Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class
classification problems. It is also known as Normal Discriminant Analysis (NDA)
or Discriminant Function Analysis (DFA).
 This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs.

 Although the logistic regression algorithm is limited to only two-class, linear
Discriminant analysis is applicable for more than two classes of classification
problems.
 Linear Discriminant analysis is one of the most popular dimensionality reduction
techniques used for supervised classification problems in machine learning.
 It is also considered a pre-processing step for modeling differences in ML and
applications of pattern classification.
 Whenever there is a requirement to separate two or more classes having multiple
features efficiently, the Linear Discriminant Analysis model is considered the most
common technique to solve such classification problems. For e.g., if we have two
classes with multiple features and need to separate them efficiently. When we
classify them using a single feature, then it may show overlapping.

 At the heart of deep learning lies the neural network, an intricate interconnected
system of nodes that mimics the human brain’s neural architecture.
 Neural networks excel at discerning intricate patterns and representations within
vast datasets, allowing them to make predictions, classify information, and
generate novel insights.
 Autoencoders emerge as a fascinating subset of neural networks, offering a unique
approach to unsupervised learning.
 Autoencoders are an adaptable and strong class of architectures for the dynamic
field of deep learning, where neural networks develop constantly to identify
complicated patterns and representations.
 With their ability to learn effective representations of data, these unsupervised
learning models have received considerable attention and are useful in a wide
variety of areas, from image processing to anomaly detection.

 Auto encoders are a specialized class of algorithms that can learn efficient
representations of input data with no need for labels.
 It is a class of artificial neural networks designed for unsupervised learning.
 Learning to compress and effectively represent input data without specific labels
is the essential principle of an automatic decoder.
 This is accomplished using a two-fold structure that consists of an encoder and a
decoder.
 The encoder transforms the input data into a reduced-dimensional representation,
which is often referred to as “latent space” or “encoding”.
 From that representation, a decoder rebuilds the initial input.
 For the network to gain meaningful patterns in data, a process of encoding and
decoding facilitates the definition of essential features.

 The general architecture of an auto encoder includes an encoder, decoder, and
bottleneck layer.

 Input layer take raw input data
 The hidden layers progressively reduce the dimensionality of the input, capturing
important features and patterns. These layer compose the encoder.
 The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is
significantly reduced.
 This layer represents the compressed encoding of the input data.
 Decoder
 The bottleneck layer takes the encoded representation and expands it back to the
dimensionality of the original input.
 The hidden layers progressively increase the dimensionality and aim to reconstruct the
original input.
 The output layer produces the reconstructed output, which ideally should be as close as
possible to the input data.

 These are some groundbreaking CNN architectures that were proposed to achieve
a better accuracy and to reduce the computational cost .
 AlexNet
 This network was very similar to LeNet-5 but was deeper with 8 layers, with more
filters, stacked convolutional layers, max pooling, dropout, data augmentation,
ReLU and SGD.
 AlexNet was the winner of the ImageNet ILSVRC-2012 competition, designed by
Alex Krizhevsky, Ilya Sutskever and Geoffery E. Hinton.
 It was trained on two Nvidia Geforce GTX 580 GPUs, therefore, the network was
split into two pipelines.
 AlexNet has 5 Convolution layers and 3 fully connected layers. AlexNet consists of
approximately 60 M parameters. A major drawback of this network was that it
comprises of too many hyper-parameters.

 The major shortcoming of too many hyper-parameters of AlexNet was solved by VGG
Net by replacing large kernel-sized filters (11 and 5 in the first and second convolution
layer, respectively) with multiple 3×3 kernel-sized filters one after another.
 The architecture developed by Simonyan and Zisserman was the 1st runner up of the
Visual Recognition Challenge of 2014.
 The architecture consist of 3*3 Convolutional filters, 2*2 Max Pooling layer with a
stride of 1, keeping the padding same to preserve the dimension.
 In total, there are 16 layers in the network where the input image is RGB format with
dimension of 224*224*3, followed by 5 pairs of Convolution(filters: 64, 128,
256,512,512) and Max Pooling.
 The output of these layers is fed into three fully connected layers and a softmax
function in the output layer.
 In total there are 138 Million parameters in VGG Net.

 1. Long training time
2. Heavy model
3. Computationally expensive
4. Vanishing/exploding gradient problem

 Inception network also known as GoogleLe Net was proposed by developers at
google in “Going Deeper with Convolutions” in 2014.
 The motivation of InceptionNet comes from the presence of sparse features Salient
parts in the image that can have a large variation in size.
 Due to this, the selection of right kernel size becomes extremely difficult as big
kernels are selected for global features and small kernels when the features are
locally located.
 The InceptionNets resolves this by stacking multiple kernels at the same level.
 Typically it uses 5*5, 3*3 and 1*1 filters in one go.
 For better understanding refer to the image below:

 ResNet, the winner of ILSVRC-2015 competition are deep networks of over 100
layers.
 Residual networks are similar to VGG nets however with a sequential approach
they also use “Skip connections” and “batch normalization” that helps to train
deep layers without hampering the performance.
 After VGG Nets, as CNNs were going deep, it was becoming hard to train them
because of vanishing gradients problem that makes the derivate infinitely small.
 Therefore, the overall performance saturates or even degrades.
 The idea of skips connection came from highway network where gated shortcut
connections were used.

 While building and training neural networks, it is crucial to initialize the weights
appropriately to ensure a model with high accuracy.
 If the weights are not correctly initialized, it may give rise to the Vanishing Gradient
problem or the Exploding Gradient problem.
 Hence, selecting an appropriate weight initialization strategy is critical when training
DL models.
 Following notations must be kept in mind while understanding the Weight
Initialization Techniques. These notations may vary at different publications.
However, the ones used here are the most common, usually found in research papers.
 fan_in = Number of input paths towards the neuron
 fan_out = Number of output paths towards the neuron

 For the above neuron,
 fan_in = 3 (Number of input paths towards the neuron)
 fan_out = 2 (Number of output paths towards the neuron)

 Internal covariate shift is a major challenge encountered while training deep learning
models.
 Batch normalization was introduced to address this issue.
 In this article, we are going to learn the fundamentals and need of Batch
normalization.
 What is Batch Normalization?
 Batch normalization was introduced to mitigate the internal covariate shift problem
in neural networks by Sergey Ioffe and Christian Szegedy in 2015.
 The normalization process involves calculating the mean and variance of each feature
in a mini-batch and then scaling and shifting the features using these statistics.
 This ensures that the input to each layer remains roughly in the same distribution,
regardless of changes in the distribution of earlier layers’ outputs.
 Consequently, Batch Normalization helps in stabilizing the training process, enabling
higher learning rates and faster convergence.

 Faster Convergence: Batch Normalization reduces internal covariate shift,
allowing for faster convergence during training.
 Higher Learning Rates: With Batch Normalization, higher learning rates can be
used without the risk of divergence.
 Regularization Effect: Batch Normalization introduces a slight regularization
effect that reduces the need for adding regularization techniques like dropout.

 What are the Hyperparameters?
 Hyperparameters are those parameters that we set for training.
 Hyperparameters have major impacts on accuracy and efficiency while training
the model.
 Therefore it needed to be set accurately to get better and efficient results.
 Hyperparameters are pre-established parameters that are not learned during the
training process. They control a machine learning model’s general behaviour,
including its architecture, regularisation strengths, and learning rates.
 The process of determining the ideal set of hyperparameters for a machine
learning model is known as hyperparameter optimization.
 Usually, strategies like grid search, random search, and more sophisticated ones
like genetic algorithms or Bayesian optimization are used to accomplish this.

Deep learning(UNIT 3) BY Ms SURBHI SAROHA

Recommended

More Related Content

Similar to Deep learning(UNIT 3) BY Ms SURBHI SAROHA (20)

More from Dr. SURBHI SAROHA (20)

Recently uploaded (20)

Deep learning(UNIT 3) BY Ms SURBHI SAROHA