0% found this document useful (0 votes)
5 views

DL-UNIT_3

The document discusses autoencoders and their relation to PCA, highlighting that while PCA is a linear technique for dimensionality reduction, autoencoders can learn both linear and nonlinear representations. It also covers various regularization techniques for autoencoders, such as L1/L2 regularization, denoising, and sparse autoencoders, to improve generalization and prevent overfitting. Additionally, it explains concepts like the bias-variance tradeoff, early stopping, dataset augmentation, parameter sharing, greedy layer-wise training, and activation functions in neural networks.

Uploaded by

23f2002473
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DL-UNIT_3

The document discusses autoencoders and their relation to PCA, highlighting that while PCA is a linear technique for dimensionality reduction, autoencoders can learn both linear and nonlinear representations. It also covers various regularization techniques for autoencoders, such as L1/L2 regularization, denoising, and sparse autoencoders, to improve generalization and prevent overfitting. Additionally, it explains concepts like the bias-variance tradeoff, early stopping, dataset augmentation, parameter sharing, greedy layer-wise training, and activation functions in neural networks.

Uploaded by

23f2002473
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Deep Learning: - UNIT- 3

Autoencoders and relation to PCA:


Autoencoders and Principal Component Analysis (PCA) are both techniques used for
dimensionality reduction and feature extraction, but they have key differences in how they
achieve this.

1. Principal Component Analysis (PCA)

PCA is a linear technique that finds an optimal set of orthogonal axes (principal components)
along which the data varies the most. It projects data onto these components to reduce
dimensionality while retaining as much variance as possible. PCA is mathematically
straightforward and uses Singular Value Decomposition (SVD) to compute the principal
components.

Linear transformation

Finds directions of maximum variance

Uses eigenvectors of the covariance matrix

Optimal in terms of minimizing reconstruction error for linear projections

Interpretable as rotations and projections in feature space

2. Autoencoders

Autoencoders are a type of neural network that learn efficient data representations in an
unsupervised manner. They consist of an encoder, which compresses input data into a lower-
dimensional latent space, and a decoder, which reconstructs the original data from this
compressed representation. Autoencoders can learn both linear and nonlinear transformations,
making them more flexible than PCA.

Can learn nonlinear mappings

Typically consist of multiple layers (deep autoencoders)

Minimize reconstruction error using neural network optimization techniques (e.g.,


backpropagation)

Can incorporate additional constraints like sparsity or denoising

Relation Between PCA and Autoencoders

A linear autoencoder (with a single hidden layer and linear activation functions) behaves
similarly to PCA. It finds a subspace that captures maximum variance, similar to PCA’s principal
components.
Unlike PCA, nonlinear autoencoders can capture more complex patterns in data by learning a
more flexible manifold structure.

Autoencoders, especially deep autoencoders, can learn hierarchical representations, which PCA
cannot.PCA is deterministic and has a closed-form solution, while autoencoders require training
with optimization methods.

Which One to Use?

Use PCA when you need a fast, interpretable, and optimal linear transformation.

Use Autoencoders when your data is complex and you suspect nonlinear structures that PCA
cannot capture.

Regularization In Autoencoders:
Regularization in autoencoders helps improve their generalization ability by preventing
overfitting and ensuring meaningful feature extraction. Various regularization techniques can
be applied to autoencoders, including:

1. L1 & L2 Regularization (Weight Decay)

L1 Regularization (Lasso) promotes sparsity in the weights, encouraging certain connections to


be zero.

L2 Regularization (Ridge) prevents large weight values, leading to a more stable and smooth
representation.

2. Sparse Autoencoders

Introduces a sparsity constraint on the hidden units using KL divergence or L1 regularization.

Ensures that only a subset of neurons activate, leading to efficient feature learning.

3. Denoising Autoencoders

Adds noise (e.g., Gaussian, salt-and-pepper) to the input and trains the network to reconstruct
the original clean data.

Encourages robustness and prevents the model from memorizing training data.

4. Contractive Autoencoders

Adds a penalty term on the Jacobian of the encoder to minimize sensitivity to small input
variations.

Forces the latent representation to be robust to slight changes in input.

5. Variational Autoencoders (VAE)

Introduces a probabilistic framework by enforcing a prior distribution (e.g., Gaussian) on the


latent space.

Uses KL divergence to regularize the latent distribution, ensuring structured and meaningful
embeddings.

6. Dropout Regularization

Randomly drops neurons during training to prevent over-reliance on specific features.

Encourages redundancy and robustness in learned representations.

7. Batch Normalization & Layer Normalization

Normalizes activations to stabilize training and reduce internal covariate shifts.

Improves generalization and speeds up convergence.

Autoencoders are a specialized class of algorithms that can learn efficient representations of
input data with no need for labels. It is a class of artificial neural networks designed for
unsupervised learning. Learning to compress and effectively represent input data without
specific labels is the essential principle of an automatic decoder. This is accomplished using a
two-fold structure that consists of an encoder and a decoder. The encoder transforms the input
data into a reduced-dimensional representation, which is often referred to as “latent space” or
“encoding”. From that representation, a decoder rebuilds the initial input. For the network to
gain meaningful patterns in data, a process of encoding and decoding facilitates the definition
of essential features.
Denoising Autoencoders:
Now, a denoising autoencoder is a modification of the original autoencoder in which instead of
giving the original input we give a corrupted or noisy version of input to the encoder while
decoder loss is calculated concerning original input only. This results in efficient learning of
autoencoders and the risk of autoencoder becoming an identity function is significantly
reduced.

Denoising Autoencoders (DAEs) are a type of autoencoder designed to remove noise from data
by learning a robust representation of the input. They are widely used in image processing,
speech enhancement, and feature learning.

How Denoising Autoencoders Work

• Corrupting the Input: A noisy version of the input is created by adding noise (e.g.,
Gaussian noise, salt-and-pepper noise, or occlusions).

• Encoding: The noisy input is passed through an encoder, which maps it to a lower-
dimensional latent space.

• Decoding: The decoder reconstructs the denoised version of the input from the latent
representation.

• Loss Function: The model is trained using a loss function that minimizes the difference
between the reconstructed output and the clean input.

• Applications of Denoising Autoencoders

• Image Denoising: Removing noise from images (e.g., medical imaging, photography).

• Speech Enhancement: Improving the quality of speech signals.

• Feature Learning: Extracting robust representations for downstream tasks like


classification.

• Anomaly Detection: Identifying irregularities in data by comparing reconstructed


outputs to the original input.
Sparse Autoencoders:
KL Divergence: Encourages the activation of neurons to match a desired sparsity level.

L1 Regularization (Lasso): Promotes sparsity by penalizing large weights.

4. Hidden Layer Activation

Uses non-linear activation functions (ReLU, Sigmoid, Tanh) to control neuron activation.

The average activation of neurons is kept low to enforce

sparsity.

Bias Variance Tradeoff:


The bias-variance tradeoff is a fundamental concept in machine learning and statistics that
describes the balance between two sources of error that affect the performance of predictive
models:

1. Bias

Bias refers to the error introduced by approximating a real-world problem with a simplified
model.

High bias means the model makes strong assumptions about the data, leading to underfitting.

Example: A linear regression model used to fit a complex, highly non-linear dataset will have
high bias.

2. Variance

Variance refers to how much the model's predictions fluctuate based on the training data.

High variance means the model is too sensitive to small fluctuations in the training set, leading
to overfitting.
Example: A deep neural network that perfectly fits the training data but performs poorly on
new data has high variance.

The Tradeoff

Increasing model complexity reduces bias but increases variance.

Simplifying the model reduces variance but increases bias.

The goal is to find the optimal balance where both bias and variance are minimized to achieve
the lowest total error.

Graphical Representation

The typical error curve shows:

• High bias: The model performs poorly on both training and test data.

• High variance: The model does well on training data but poorly on test data.

• Optimal point: A balance where both errors are minimized.

• How to Manage the Tradeoff

• Regularization (e.g., L1/L2 penalties): Prevents overfitting by discouraging overly


complex models.

• Cross-validation: Helps detect high variance and tune model complexity.

• Feature selection: Reducing irrelevant features can help control variance.

• Ensemble methods (e.g., bagging, boosting): Helps reduce variance while maintaining
low bias.

The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as well
as testing data. It recommended that an algorithm should always be low-biased to avoid the
problem of underfitting. By high bias, the data predicted is in a straight line format, thus not
fitting accurately in the data in the data set. Such fitting is known as the Underfitting of Data.
This happens when the hypothesis is too simple or linear in nature. Refer to the graph given
below for an example of such a situation.
Early Stopping:
In Regularization by Early Stopping, we stop training the model when the performance on the
validation set is getting worse- increasing loss decreasing accuracy, or poorer scores of the
scoring metric. By plotting the error on the training dataset and the validation dataset together,
both the errors decrease with a number of iterations until the point where the model starts to
overfit. After this point, the training error still decreases but the validation error increases.

So, even if training is continued after this point, early stopping essentially returns the set of
parameters that were used at this point and so is equivalent to stopping training at that point.
So, the final parameters returned will enable the model to have low variance and better
generalization. The model at the time the training is stopped will have a better generalization
performance than the model with the least training error.

on the validation set is getting worse- increasing loss or decreasing accuracy or poorer scores

Early stopping can be thought of as implicit regularization, contrary to regularization via weight
decay. This method is also efficient since it requires less amount of training data, which is not
always available. Due to this fact, early stopping requires lesser time for training compared to
other regularization methods. Repeating the early stopping process many times may result in
the model overfitting the validation dataset, just as similar as overfitting occurs in the case of
training data.

The number of iterations (i.e. epoch) taken to train the model can be considered a
hyperparameter. Then the model has to find an optimum value for this hyperparameter (by
hyperparameter tuning) for the best performance of the learning model.

Early stopping is a regularization technique used in machine learning to prevent overfitting by


stopping the training process when a model’s performance on a validation dataset starts to
degrade.

How Early Stopping Works

• Monitor Performance – During training, the model’s loss or accuracy is evaluated on


both the training and validation datasets.
• Detect Overfitting – If the validation loss starts increasing while the training loss
continues to decrease, the model is likely overfitting.

• Stop Training – Training is stopped when the validation loss (or another metric) has not
improved for a set number of epochs (patience).

• Use Best Model – The model is typically restored to the weights from the epoch with the
best validation performance.

Dataset Augmentation:
The best way to make a machine learning model generalize better is to train it on more data. Of
course, in practice, the amount of data we have is limited. One way to get around this problem
is to create new data and add it to the training set.

Data augmentation is easiest for classification, Classifier takes high-dimensional input x and
summarizes it with a single category identity y. Main task of classifier is to be invariant to a
wide variety of transformations. We can generate new samples (x,y) just by transforming
inputs.

This Approach not easily generalized to other problems, Example density estimation problem. It
is not possible generate new data without solving density estimation.

Dataset augmentation is a technique used in deep learning to artificially expand the size and
diversity of training datasets by applying various transformations to the existing data. This helps
improve model generalization, reduce overfitting, and make models more robust to real-world
variations.

Parameter Sharing In TYPING:


The parameters of one model, trained as a classifier in a supervised paradigm, were regularised
to be close to the parameters of another model, trained in an unsupervised paradigm, using
this method (to capture the distribution of the observed input data). Many of the parameters in
the classifier model might be linked with similar parameters in the unsupervised model thanks
to the designs. While a parameter norm penalty is one technique to require sets of parameters
to be equal, constraints are a more prevalent way to regularise parameters to be close to one
another. Because we view the numerous models or model components as sharing a unique set
of parameters, this form of regularisation is commonly referred to as parameter sharing. The
fact that only a subset of the parameters (the unique set) needs to be retained in memory is a
significant advantage of parameter sharing over regularising the parameters to be close
(through a norm penalty). This can result in a large reduction in the memory footprint of certain
models, such as the convolutional neural network.

Convolutional neural networks (CNNs) used in computer vision are by far the most widespread
and extensive usage of parameter sharing. Many statistical features of natural images are
translation insensitive. A shot of a cat, for example, can be translated one pixel to the right and
still be a shot of a cat. By sharing parameters across several picture locations, CNNs take this
property into account. Different locations in the input are computed with the same feature (a
hidden unit with the same weights). This indicates that whether the cat appears in column i or
column i + 1 in the image, we can find it with the same cat detector.

CNN’s have been able to reduce the number of unique model parameters and raise network
sizes greatly without requiring a comparable increase in training data thanks to parameter
sharing. It’s still one of the best illustrations of how domain knowledge can be efficiently
integrated into the network architecture.

In the context of machine learning, "parameter sharing" refers to the practice of using the same
set of parameters across different parts of a model, essentially allowing different sections to
learn similar features and reducing the overall number of parameters needed, which is
particularly useful in convolutional neural networks (CNNs) where features might be present at
different locations within an image; it's a way to make the model more efficient and robust by
leveraging shared information across various parts of the data.

Greedy Layer Wise Training:


Artificial intelligence has undergone a revolution thanks to neural networks, which have made
significant strides possible in a number of areas like speech recognition, computer vision, and
natural language processing. Deep neural network training, however, may be difficult,
particularly when working with big, complicated datasets. One method that tackles some of
these issues is greedy layer-wise pre-training, which initializes deep neural network settings
layer by layer.

Greedy layer-wise pre-training is used to initialize the parameters of deep neural networks
layer by layer, beginning with the first layer and working through each one that follows. A layer
is trained as if it were a stand-alone model at each step, using input from the layer before it and
output to go to the layer after it. Typically, developing usable representations of the input data
is the training aim.

Processes of Greedy Layer-Wise Pre-Training

The process of greedy layer-wise pre-training can be staged as follows:


• Initialization: The neural network's first layer is trained on its own using autoencoders
and other unsupervised learning strategies. Learning a collection of features that
highlight important elements of the input data is the aim.

• Extracting Feature: The activations of the first layer are utilized as features to train the
subsequent layer after it has been trained. Each layer learns to represent the traits
discovered by the layer before it in a higher-level abstraction when this process is
repeated repeatedly.

• Fine-Tuning: The network is adjusted as a whole using supervised learning methods


once every layer has been pretrained in this way. To maximize performance on a
particular job, this entails simultaneously modifying all of the network's parameters
using a labeled dataset

Better Activation Functions:


An activation function is a mathematical function applied to the output of a neuron. It
introduces non-linearity into the model, allowing the network to learn and represent complex
patterns in the data. Without this non-linearity feature, a neural network would behave like a
linear regression model, no matter how many layers it has.

The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex decisions
and predictions by introducing non-linearities to the output of each neuron.

A paradigm for information processing that draws inspiration from the brain is called an
artificial neural network (ANN). ANNs learn via imitation just like people do. Through a learning
process, an ANN is tailored for a particular purpose, including such pattern classification or data
classification. The synapses interconnections that exist between both the neurons change
because of learning.

What input layer to employ with in hidden layer and at the input level of the network is one of
the decisions you get to make while creating a neural network. This article discusses a few of
the alternatives.

The nerve impulse in neurology serves as a model for activation functions within computer
science. A chain reaction permits a neuron to "fire" and send a signal to nearby neurons if the
induced voltage between its interior and exterior exceeds a threshold value known as the
action potential. The next series of activations, known as a "spike train," enables motor neurons
to transfer commands from of the brain to the limbs and sensory neurons too transmit
sensation from the digits to the brain.

In artificial neural networks, an activation function is one that outputs a smaller value for tiny
inputs and a higher value if its inputs are greater than a threshold. An activation function "fires"
if the inputs are big enough; otherwise, nothing happens. An activation function, then, is a gate
that verifies how an incoming value is higher than a threshold value.

Because they introduce non-linearities in neural networks and enable the neural networks can
learn powerful operations, activation functions are helpful. A feedforward neural network
might be refactored into a straightforward linear function or matrix transformation on to its
input if indeed the activation functions were taken out.
By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to boost a
neuron's output's nonlinearity.

Explanation: As we are aware, neurons in neural networks operate in accordance with weight,
bias, and their corresponding activation functions. Based on the mistake, the values of the
neurons inside a neural network would be modified. This process is known as back-propagation.
Back-propagation is made possible by activation functions since they provide the gradients and
error required to change the biases and weights.

Better Weight Initialization Methods:


Weight initialization is an essential aspect of training neural networks, influencing their
convergence speed, stability, and general performance. Initializing the weights of a neural
community properly can cause quicker convergence at some stage in schooling and better
generalization on unseen data.

A neural network may be considered as a function with learnable parameters, which are
commonly referred to as weights and biases. Now, when neural nets are first trained, these
parameters (typically the weights) are initialized in a variety of ways, including using constant
values like 0's and 1's, values sampled from some distribution (typically a uniform distribution
or normal distribution), and other sophisticated schemes such as Xavier Initialization.

A neural network's performance is heavily influenced by how its parameters are initialized
when it first begins training. Furthermore, if we initialize it at random for each run, it is certain
to be non-reproducible (nearly) and even underperforming. On the other hand, if we initialize it
with constant values, it may take an extremely long time to converge. We also eliminate the
beauty of randomness, giving a neural net the ability to achieve convergence faster via
gradient-based learning. We certainly require a better technique to initialize it.

Challenges of Weight Initialisation

Weight initialization presents a hurdle owing to the non-linear activation functions employed in
neural networks, such as sigmoid, tanh, and ReLU. These activation functions operate optimally
within particular ranges. For example, the sigmoid function returns values between 0 and 1,
whereas tanh returns values between -1 and 1. If the initial weights are too big or too little, the
activations might become saturated, resulting in disappearing gradients or sluggish
convergence.

Another problem is keeping the variation of activations and gradients consistent across the
network's layers. As the signal travels through numerous levels, it might increase or diminish,
compromising training stability. Proper weight initialization strategies strive to overcome these
problems while also ensuring robust and efficient neural network training.
Batch Normalization:
Batch normalization (also known as batch norm) is a method used to make training of artificial
neural networks faster and more stable through normalization of the layers' inputs by re-
centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.[1]

The reasons behind the effectiveness of batch normalization remain under discussion. It was
believed that it can mitigate the problem of internal covariate shift, where parameter
initialization and changes in the distribution of the inputs of each layer affect the learning rate
of the network.[1] Recently, some scholars have argued that batch normalization does not
reduce internal covariate shift, but rather smooths the objective function, which in turn
improves the performance.[2] However, at initialization, batch normalization in fact induces
severe gradient explosion in deep networks, which is only alleviated by skip connections in
residual networks.[3] Others maintain that batch normalization achieves length-direction
decoupling, and thereby accelerates neural networks

Batch normalization was introduced to mitigate the internal covariate shift problem in neural
networks by Sergey Ioffe and Christian Szegedy in 2015. The normalization process involves
calculating the mean and variance of each feature in a mini-batch and then scaling and shifting
the features using these statistics. This ensures that the input to each layer remains roughly in
the same distribution, regardless of changes in the distribution of earlier layers' outputs.
Consequently, Batch Normalization helps in stabilizing the training process, enabling higher
learning rates and faster convergence.

Batch normalization is a deep learning approach that has been shown to significantly improve
the efficiency and reliability of neural network models. It is particularly useful for training very
deep networks, as it can help to reduce the internal covariate shift that can occur during
training.

Batch normalization is a supervised learning method for normalizing the interlayer outputs of a
neural network. As a result, the next layer receives a “reset” of the output distribution from the
preceding layer, allowing it to analyze the data more effectively.

The term “internal covariate shift” is used to describe the effect that updating the parameters
of the layers above it has on the distribution of inputs to the current layer during deep learning
training. This can make the optimization process more difficult and can slow down the
convergence of the model.

You might also like