DL-UNIT_3
DL-UNIT_3
PCA is a linear technique that finds an optimal set of orthogonal axes (principal components)
along which the data varies the most. It projects data onto these components to reduce
dimensionality while retaining as much variance as possible. PCA is mathematically
straightforward and uses Singular Value Decomposition (SVD) to compute the principal
components.
Linear transformation
2. Autoencoders
Autoencoders are a type of neural network that learn efficient data representations in an
unsupervised manner. They consist of an encoder, which compresses input data into a lower-
dimensional latent space, and a decoder, which reconstructs the original data from this
compressed representation. Autoencoders can learn both linear and nonlinear transformations,
making them more flexible than PCA.
A linear autoencoder (with a single hidden layer and linear activation functions) behaves
similarly to PCA. It finds a subspace that captures maximum variance, similar to PCA’s principal
components.
Unlike PCA, nonlinear autoencoders can capture more complex patterns in data by learning a
more flexible manifold structure.
Autoencoders, especially deep autoencoders, can learn hierarchical representations, which PCA
cannot.PCA is deterministic and has a closed-form solution, while autoencoders require training
with optimization methods.
Use PCA when you need a fast, interpretable, and optimal linear transformation.
Use Autoencoders when your data is complex and you suspect nonlinear structures that PCA
cannot capture.
Regularization In Autoencoders:
Regularization in autoencoders helps improve their generalization ability by preventing
overfitting and ensuring meaningful feature extraction. Various regularization techniques can
be applied to autoencoders, including:
L2 Regularization (Ridge) prevents large weight values, leading to a more stable and smooth
representation.
2. Sparse Autoencoders
Ensures that only a subset of neurons activate, leading to efficient feature learning.
3. Denoising Autoencoders
Adds noise (e.g., Gaussian, salt-and-pepper) to the input and trains the network to reconstruct
the original clean data.
Encourages robustness and prevents the model from memorizing training data.
4. Contractive Autoencoders
Adds a penalty term on the Jacobian of the encoder to minimize sensitivity to small input
variations.
Uses KL divergence to regularize the latent distribution, ensuring structured and meaningful
embeddings.
6. Dropout Regularization
Autoencoders are a specialized class of algorithms that can learn efficient representations of
input data with no need for labels. It is a class of artificial neural networks designed for
unsupervised learning. Learning to compress and effectively represent input data without
specific labels is the essential principle of an automatic decoder. This is accomplished using a
two-fold structure that consists of an encoder and a decoder. The encoder transforms the input
data into a reduced-dimensional representation, which is often referred to as “latent space” or
“encoding”. From that representation, a decoder rebuilds the initial input. For the network to
gain meaningful patterns in data, a process of encoding and decoding facilitates the definition
of essential features.
Denoising Autoencoders:
Now, a denoising autoencoder is a modification of the original autoencoder in which instead of
giving the original input we give a corrupted or noisy version of input to the encoder while
decoder loss is calculated concerning original input only. This results in efficient learning of
autoencoders and the risk of autoencoder becoming an identity function is significantly
reduced.
Denoising Autoencoders (DAEs) are a type of autoencoder designed to remove noise from data
by learning a robust representation of the input. They are widely used in image processing,
speech enhancement, and feature learning.
• Corrupting the Input: A noisy version of the input is created by adding noise (e.g.,
Gaussian noise, salt-and-pepper noise, or occlusions).
• Encoding: The noisy input is passed through an encoder, which maps it to a lower-
dimensional latent space.
• Decoding: The decoder reconstructs the denoised version of the input from the latent
representation.
• Loss Function: The model is trained using a loss function that minimizes the difference
between the reconstructed output and the clean input.
• Image Denoising: Removing noise from images (e.g., medical imaging, photography).
Uses non-linear activation functions (ReLU, Sigmoid, Tanh) to control neuron activation.
sparsity.
1. Bias
Bias refers to the error introduced by approximating a real-world problem with a simplified
model.
High bias means the model makes strong assumptions about the data, leading to underfitting.
Example: A linear regression model used to fit a complex, highly non-linear dataset will have
high bias.
2. Variance
Variance refers to how much the model's predictions fluctuate based on the training data.
High variance means the model is too sensitive to small fluctuations in the training set, leading
to overfitting.
Example: A deep neural network that perfectly fits the training data but performs poorly on
new data has high variance.
The Tradeoff
The goal is to find the optimal balance where both bias and variance are minimized to achieve
the lowest total error.
Graphical Representation
• High bias: The model performs poorly on both training and test data.
• High variance: The model does well on training data but poorly on test data.
• Ensemble methods (e.g., bagging, boosting): Helps reduce variance while maintaining
low bias.
The bias is known as the difference between the prediction of the values by the Machine
Learning model and the correct value. Being high in biasing gives a large error in training as well
as testing data. It recommended that an algorithm should always be low-biased to avoid the
problem of underfitting. By high bias, the data predicted is in a straight line format, thus not
fitting accurately in the data in the data set. Such fitting is known as the Underfitting of Data.
This happens when the hypothesis is too simple or linear in nature. Refer to the graph given
below for an example of such a situation.
Early Stopping:
In Regularization by Early Stopping, we stop training the model when the performance on the
validation set is getting worse- increasing loss decreasing accuracy, or poorer scores of the
scoring metric. By plotting the error on the training dataset and the validation dataset together,
both the errors decrease with a number of iterations until the point where the model starts to
overfit. After this point, the training error still decreases but the validation error increases.
So, even if training is continued after this point, early stopping essentially returns the set of
parameters that were used at this point and so is equivalent to stopping training at that point.
So, the final parameters returned will enable the model to have low variance and better
generalization. The model at the time the training is stopped will have a better generalization
performance than the model with the least training error.
on the validation set is getting worse- increasing loss or decreasing accuracy or poorer scores
Early stopping can be thought of as implicit regularization, contrary to regularization via weight
decay. This method is also efficient since it requires less amount of training data, which is not
always available. Due to this fact, early stopping requires lesser time for training compared to
other regularization methods. Repeating the early stopping process many times may result in
the model overfitting the validation dataset, just as similar as overfitting occurs in the case of
training data.
The number of iterations (i.e. epoch) taken to train the model can be considered a
hyperparameter. Then the model has to find an optimum value for this hyperparameter (by
hyperparameter tuning) for the best performance of the learning model.
• Stop Training – Training is stopped when the validation loss (or another metric) has not
improved for a set number of epochs (patience).
• Use Best Model – The model is typically restored to the weights from the epoch with the
best validation performance.
Dataset Augmentation:
The best way to make a machine learning model generalize better is to train it on more data. Of
course, in practice, the amount of data we have is limited. One way to get around this problem
is to create new data and add it to the training set.
Data augmentation is easiest for classification, Classifier takes high-dimensional input x and
summarizes it with a single category identity y. Main task of classifier is to be invariant to a
wide variety of transformations. We can generate new samples (x,y) just by transforming
inputs.
This Approach not easily generalized to other problems, Example density estimation problem. It
is not possible generate new data without solving density estimation.
Dataset augmentation is a technique used in deep learning to artificially expand the size and
diversity of training datasets by applying various transformations to the existing data. This helps
improve model generalization, reduce overfitting, and make models more robust to real-world
variations.
Convolutional neural networks (CNNs) used in computer vision are by far the most widespread
and extensive usage of parameter sharing. Many statistical features of natural images are
translation insensitive. A shot of a cat, for example, can be translated one pixel to the right and
still be a shot of a cat. By sharing parameters across several picture locations, CNNs take this
property into account. Different locations in the input are computed with the same feature (a
hidden unit with the same weights). This indicates that whether the cat appears in column i or
column i + 1 in the image, we can find it with the same cat detector.
CNN’s have been able to reduce the number of unique model parameters and raise network
sizes greatly without requiring a comparable increase in training data thanks to parameter
sharing. It’s still one of the best illustrations of how domain knowledge can be efficiently
integrated into the network architecture.
In the context of machine learning, "parameter sharing" refers to the practice of using the same
set of parameters across different parts of a model, essentially allowing different sections to
learn similar features and reducing the overall number of parameters needed, which is
particularly useful in convolutional neural networks (CNNs) where features might be present at
different locations within an image; it's a way to make the model more efficient and robust by
leveraging shared information across various parts of the data.
Greedy layer-wise pre-training is used to initialize the parameters of deep neural networks
layer by layer, beginning with the first layer and working through each one that follows. A layer
is trained as if it were a stand-alone model at each step, using input from the layer before it and
output to go to the layer after it. Typically, developing usable representations of the input data
is the training aim.
• Extracting Feature: The activations of the first layer are utilized as features to train the
subsequent layer after it has been trained. Each layer learns to represent the traits
discovered by the layer before it in a higher-level abstraction when this process is
repeated repeatedly.
The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex decisions
and predictions by introducing non-linearities to the output of each neuron.
A paradigm for information processing that draws inspiration from the brain is called an
artificial neural network (ANN). ANNs learn via imitation just like people do. Through a learning
process, an ANN is tailored for a particular purpose, including such pattern classification or data
classification. The synapses interconnections that exist between both the neurons change
because of learning.
What input layer to employ with in hidden layer and at the input level of the network is one of
the decisions you get to make while creating a neural network. This article discusses a few of
the alternatives.
The nerve impulse in neurology serves as a model for activation functions within computer
science. A chain reaction permits a neuron to "fire" and send a signal to nearby neurons if the
induced voltage between its interior and exterior exceeds a threshold value known as the
action potential. The next series of activations, known as a "spike train," enables motor neurons
to transfer commands from of the brain to the limbs and sensory neurons too transmit
sensation from the digits to the brain.
In artificial neural networks, an activation function is one that outputs a smaller value for tiny
inputs and a higher value if its inputs are greater than a threshold. An activation function "fires"
if the inputs are big enough; otherwise, nothing happens. An activation function, then, is a gate
that verifies how an incoming value is higher than a threshold value.
Because they introduce non-linearities in neural networks and enable the neural networks can
learn powerful operations, activation functions are helpful. A feedforward neural network
might be refactored into a straightforward linear function or matrix transformation on to its
input if indeed the activation functions were taken out.
By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to boost a
neuron's output's nonlinearity.
Explanation: As we are aware, neurons in neural networks operate in accordance with weight,
bias, and their corresponding activation functions. Based on the mistake, the values of the
neurons inside a neural network would be modified. This process is known as back-propagation.
Back-propagation is made possible by activation functions since they provide the gradients and
error required to change the biases and weights.
A neural network may be considered as a function with learnable parameters, which are
commonly referred to as weights and biases. Now, when neural nets are first trained, these
parameters (typically the weights) are initialized in a variety of ways, including using constant
values like 0's and 1's, values sampled from some distribution (typically a uniform distribution
or normal distribution), and other sophisticated schemes such as Xavier Initialization.
A neural network's performance is heavily influenced by how its parameters are initialized
when it first begins training. Furthermore, if we initialize it at random for each run, it is certain
to be non-reproducible (nearly) and even underperforming. On the other hand, if we initialize it
with constant values, it may take an extremely long time to converge. We also eliminate the
beauty of randomness, giving a neural net the ability to achieve convergence faster via
gradient-based learning. We certainly require a better technique to initialize it.
Weight initialization presents a hurdle owing to the non-linear activation functions employed in
neural networks, such as sigmoid, tanh, and ReLU. These activation functions operate optimally
within particular ranges. For example, the sigmoid function returns values between 0 and 1,
whereas tanh returns values between -1 and 1. If the initial weights are too big or too little, the
activations might become saturated, resulting in disappearing gradients or sluggish
convergence.
Another problem is keeping the variation of activations and gradients consistent across the
network's layers. As the signal travels through numerous levels, it might increase or diminish,
compromising training stability. Proper weight initialization strategies strive to overcome these
problems while also ensuring robust and efficient neural network training.
Batch Normalization:
Batch normalization (also known as batch norm) is a method used to make training of artificial
neural networks faster and more stable through normalization of the layers' inputs by re-
centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.[1]
The reasons behind the effectiveness of batch normalization remain under discussion. It was
believed that it can mitigate the problem of internal covariate shift, where parameter
initialization and changes in the distribution of the inputs of each layer affect the learning rate
of the network.[1] Recently, some scholars have argued that batch normalization does not
reduce internal covariate shift, but rather smooths the objective function, which in turn
improves the performance.[2] However, at initialization, batch normalization in fact induces
severe gradient explosion in deep networks, which is only alleviated by skip connections in
residual networks.[3] Others maintain that batch normalization achieves length-direction
decoupling, and thereby accelerates neural networks
Batch normalization was introduced to mitigate the internal covariate shift problem in neural
networks by Sergey Ioffe and Christian Szegedy in 2015. The normalization process involves
calculating the mean and variance of each feature in a mini-batch and then scaling and shifting
the features using these statistics. This ensures that the input to each layer remains roughly in
the same distribution, regardless of changes in the distribution of earlier layers' outputs.
Consequently, Batch Normalization helps in stabilizing the training process, enabling higher
learning rates and faster convergence.
Batch normalization is a deep learning approach that has been shown to significantly improve
the efficiency and reliability of neural network models. It is particularly useful for training very
deep networks, as it can help to reduce the internal covariate shift that can occur during
training.
Batch normalization is a supervised learning method for normalizing the interlayer outputs of a
neural network. As a result, the next layer receives a “reset” of the output distribution from the
preceding layer, allowing it to analyze the data more effectively.
The term “internal covariate shift” is used to describe the effect that updating the parameters
of the layers above it has on the distribution of inputs to the current layer during deep learning
training. This can make the optimization process more difficult and can slow down the
convergence of the model.