0% found this document useful (0 votes)
9 views

CS601_Machine Learning_Unit 2 New

Uploaded by

okchaitanya568
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

CS601_Machine Learning_Unit 2 New

Uploaded by

okchaitanya568
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Chameli Devi Group of Institutions, Indore

Department of Computer Science and Engineering

U N I T- I I
C S 6 0 1 - M AC H I N E L E A R N I N G

1
SYLLABUS & COURSE OUTCOME (UNIT-II)
• Unit –II Linearity vs non linearity, activation functions
like sigmoid, ReLU, etc., weights and bias, loss function,
gradient descent, multilayer network, backpropagation,
weight initialization, training, testing, unstable gradient
problem, auto encoders, batch normalization, dropout,
L1 and L2 regularization, momentum, tuning hyper
parameters,

• CO601.2:Student will be able to analyze a problem and identify the


computing requirements appropriate for its solution based on BPN.

2
TOPICS TO BE COVERED…
• Linearity Vs Non-Linearity
• Activation functions like Sigmoid, ReLU, etc.
• Weights, Bias, and Loss function
• Gradient Descent
• Multilayer Network
• Introduction to Back Propagation Network
• Back Propagation Training Algorithm
• Unstable Gradient Problem
• Auto Encoders
• Batch normalization, Dropout
• L1 and L2 Regularization,
• Momentum
• Tuning Hyper-parameters
3
LINEARITY VS NON-LINEARITY
• A linear model uses a linear function for its prediction function or as
a crucial part of its prediction function.
• A linear function takes a fixed number of numerical inputs x1, x2,…,
xn and weights w0,…,wn as the parameters of the model.

• If the prediction function is a linear function, we can perform


regression, i.e. predicting a numerical label.
• There are various other (more complex) options for a response
function on top of the linear function, the logistic function is very
commonly used (which leads to logistic regression, predicting a
number between 0 and 1, typically used to learn the probability of a
binary outcome in a noisy setting). 4
LINEARITY VS NON LINEARITY
• A non-linear model is a model which is not a linear model. Typically
these are more powerful (they can represent a larger class of
functions) but much harder to train.
• Nonlinear regression is a statistical technique that helps describe
nonlinear relationships in experimental data.
• Nonlinear regression models are generally assumed to be
parametric, where the model is described as a nonlinear equation.
• Parametric nonlinear regression models the dependent variable
(also called the response) as a function of a combination of
nonlinear parameters and one or more independent variables
(called predictors). The model can be univariate (single response
variable) or multivariate (multiple response variables).

5
BIOLOGICAL NEURAL NETWORK

6
ARTIFICIAL NEURAL NETWORK

7
COMPARISON BNN VS ANN

8
ACTIVATION FUNCTIONS

9
ACTIVATION FUNCTIONS

10
WEIGHTS AND BIAS

11
LOSS FUNCTION
• A loss function, or cost function, is a wrapper, around our model predict
function that tells us “how good” the model is at making predictions for a
given set of parameters.

• The loss function has its own curve and its own derivatives. The slope of
this curve tells us how to change our parameters to make the model
more accurate. We use the model to make predictions.

• We use the cost function to update our parameters. Our cost function can
take a variety of forms as there are many different cost functions
available. Popular loss functions include: MSE (L2) and Cross-entropy
Loss.

• The loss function computes the error for a single training example. The
cost function is the average of the loss functions of the entire training
12
set.
LOSS FUNCTION EXAMPLES
• 1. Squared Error Loss
 Squared Error loss for each training example, also known as L2 Loss, is
the square of the difference between the actual and the predicted values:

• 2. Absolute Error Loss


 Absolute Error for each training example is the distance between the
predicted and the actual values, irrespective of the sign. Absolute Error is
also known as the L1 loss:

13
GRADIENT DESCENT
• Gradient descent is by far the most popular optimization strategy
used in machine learning and deep learning at the moment.
• It is used when training data models, can be combined with every
algorithm and is easy to understand and implement.
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function. Gradient descent is simply used
to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.
• Example: Imagine a blindfolded man who wants to climb to the top of
a hill with the fewest steps along the way as possible.
• He might start climbing the hill by taking really big steps in the
steepest direction, which he can do as long as he is not close to the
top.
• As he comes closer to the top, however, his steps will get smaller and
smaller to avoid overshooting it. This process can be described
mathematically using the gradient. 14
WORKING OF GRADIENT DESCENT
• Instead of climbing up a hill, think of gradient descent as hiking
down to the bottom of a valley. This is a better analogy because it is
a minimization algorithm that minimizes a given function.
• The equation below describes what gradient descent does: b is the
next position of our climber, while a represents his current position.
The minus sign refers to the minimization part of gradient descent.
The gamma in the middle is a waiting factor and the gradient term
( Δf(a) ) is simply the direction of the steepest descent.

15
IMPORTANCE OF LEARNING RATE
• How big the steps are gradient descent takes into the direction of
the local minimum are determined by the learning rate, which
figures out how fast or slow we will move towards the optimal
weights.
• For gradient descent to reach the local minimum we must set the
learning rate to an appropriate value, which is neither too low nor
too high.

16
IMPORTANCE OF LEARNING RATE
• A good way to make sure gradient descent runs properly is by
plotting the cost function as the optimization runs.
• Put the number of iterations on the x-axis and the value of the cost-
function on the y-axis. This helps you see the value of your cost
function after each iteration of gradient descent and provides a way
to easily spot how appropriate your learning rate is.
• If gradient descent is working properly, the cost function should
decrease after every iteration.

17
TYPES OF GRADIENT DESCENT
• Batch Gradient Descent
 Batch gradient descent, also called vanilla gradient descent, calculates the
error for each example within the training dataset, but only after all training
examples have been evaluated does the model get updated.
 This whole process is like a cycle, and it's called a training epoch.
• Stochastic Gradient Descent
 By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each
training example one by one. Depending on the problem, this can make SGD
faster than batch gradient descent.
 One advantage is the frequent updates allow us to have a pretty detailed rate
of improvement.
• Mini-Batch Gradient Descent
 Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of SGD and batch gradient descent.
 It simply splits the training dataset into small batches and performs an update
for each of those batches. This creates a balance between the robustness of 18
stochastic gradient descent and the efficiency of batch gradient descent.
MULTILAYER NETWORK

19
MULTILAYER NETWORK

20
INTRODUCTION TO BACK PROPAGATION NETWORK

• Back propagation is a supervised learning technique for neural


networks that calculates the gradient of descent for weighting
different variables.
• It’s short for the backward propagation of errors, since the error is
computed at the output and distributed backwards throughout the
network’s layers.
• When an artificial neural network discovers an error, the algorithm
calculates the gradient of the error function, adjusted by the
network’s various weights.
• The gradient for the final layer of weights is calculated first, with the
first layer’s gradient of weights calculated last. Partial calculations of
the gradient from one layer are reused to determine the gradient for
the previous layer.
• This point of this backwards method of error checking is to more
efficiently calculate the gradient at each layer than the traditional 21

approach of calculating each layer’s gradient separately.


INTRODUCTION TO BACK PROPAGATION NETWORK

22
BACK PROPAGATION NETWORK ALGORITHM

23
BACK PROPAGATION NETWORK ALGORITHM

24
BACK PROPAGATION NETWORK ALGORITHM

25
BACK PROPAGATION NETWORK ALGORITHM

26
BACK PROPAGATION NETWORK ALGORITHM

27
BACK PROPAGATION NETWORK ALGORITHM

28
BACK PROPAGATION NETWORK ALGORITHM

29
EXAMPLE BACK PROPAGATION NETWORK

30
WEIGHT INITIALIZATION

• While building and training neural networks, it is crucial to


initialize the weights appropriately to ensure a model with high
accuracy. If the weights are not correctly initialized, it may give
rise to the Vanishing Gradient problem or the Exploding
Gradient problem. Hence, selecting an appropriate weight
initialization strategy is critical when training DL models.

31
WEIGHT INITIALIZATION

Terminology or Notations
• Following notations must be kept in mind while understanding
the Weight Initialization Techniques. These notations may vary
at different publications. However, the ones used here are the
most common, usually found in research papers.

• fan_in = Number of input paths towards the neuron

• fan_out = Number of output paths towards the neuron

32
WEIGHT INITIALIZATION

• Example: Consider the following neuron as a part of a Deep


Neural Network.

For the above neuron,


fan_in = 3 (Number of input paths towards the neuron)
fan_out = 2 (Number of output paths towards the neuron)
33
WEIGHT INITIALIZATION
TECHNIQUES

• Zero Initialization.

• Random Initialization

• Xavier/Glorot Initialization.

• Normalized Xavier/Glorot Initialization

• He Uniform Initialization

• He Normal Initialization

34
TRAINING

• ML (machine learning) model training is the process of teaching an algorithm to make predictions or
identify patterns by exposing it to labeled data, and then repeatedly refining its parameters to minimize
the difference between its predictions and the true values in the data.
How it works:
• Data Collection: A dataset containing both input features and corresponding target values is collected.
• Data Preprocessing: The data is prepared by cleaning, transforming, and normalizing it to make it suitable
for the chosen ML model.
• Model Selection: A suitable ML algorithm is chosen based on the problem and the nature of the data.
• Model Training: The algorithm is trained using the prepared data. The algorithm iteratively adjusts its
parameters based on the discrepancy between its predictions and the true values, aiming to minimize this
difference.
• Evaluation: The trained model's performance is evaluated using unseen test data to assess its ability to
make accurate predictions on new, unknown data.
• Hyperparameter Tuning: The algorithm's parameters that are not learned from the data but are set before
training (hyperparameters) are tuned to optimize the model's performance.
• Deployment: The trained and evaluated model is deployed for making predictions or solving real-world
problems. 35
TESTING

Testing in machine learning involves evaluating a model's performance on unseen data


to ensure its accuracy, reliability, and generalization to new scenarios, and to identify
potential biases or issues.
Why is testing important in Machine Learning?
Generalization: Machine learning models are trained on a specific dataset, but their
goal is to perform well on unseen data. Testing helps determine how well the model
generalizes to new, real-world scenarios.
Accuracy and Reliability: Testing ensures that the model is accurate and reliable in
its predictions or decisions.
Identifying Issues: Testing can reveal biases, errors, or areas where the model needs
improvement.
Model Selection: Different models can be tested and compared to determine which
one performs best for a specific task.
Real-world Performance: Testing provides insights into how the model will perform
in a real-world environment, which is crucial for deploying and using machine learning36
systems effectively.
UNSTABLE GRADIENT PROBLEM
• The problem of exploding or vanishing gradient decent occur while
training a neural network. This problems involves the weights in
earlier layers of the network.
• Stochastic Gradient Decent works to calculate the gradient of the
loss with respect to the weights in the network and this gradient
becomes very less in the earlier layers because gradient of loss with
respect to any given weight is going to be the product of some
derivatives that depend on components that reside later in the
network.
• So earlier layers would need a lot more terms in the product to
calculate the gradient. If some of these terms are quite small i.e less
than 1(or large i.e greater than 1), product is really less (high), and
when this product is subtracted from the weight, it will
barely(extremely) bring a change in the weight thus updating it to a
value which is not even close (very farther) to the optimal value.
• This leads to the problem of vanishing (exploding) gradient decent 37
leading to a fail in the prime objective of gradient decent.
INTRODUCTION TO AUTO ENCODERS
• An auto encoder neural network is an Unsupervised Machine
learning algorithm that applies back propagation, setting the target
values to be equal to the inputs.
• Auto encoders are used to reduce the size of our inputs into a smaller
representation. If anyone needs the original data, they can
reconstruct it from the compressed data.
• An auto encoder can learn non-linear transformations with a non-
linear activation function and multiple layers.
• It doesn’t have to learn dense layers. It can use convolutional
layers to learn which is better for video, image and series data.
• It is more efficient to learn several layers with an auto encoder rather
than learn one huge transformation.
• An auto encoder provides a representation of each layer as the
output.
• It can make use of pre-trained layers from another model to apply 38
transfer learning to enhance the encoder/decoder.
ARCHITECTURE OF AUTO ENCODERS

39
ARCHITECTURE OF AUTO ENCODERS
• Encoder: This part of the network compresses the input into
a latent space representation. The encoder layer encodes the input
image as a compressed representation in a reduced dimension. The
compressed image is the distorted version of the original image.

• Code: This part of the network represents the compressed input


which is fed to the decoder.

• Decoder: This layer decodes the encoded image back to the


original dimension. The decoded image is a lossy reconstruction of
the original image and it is reconstructed from the latent space
representation.

40
TYPES OF AUTO ENCODERS
• Convolution Auto encoders
 Auto encoders in their traditional formulation do not take into account the fact that a
signal can be seen as a sum of other signals. Convolutional Auto encoders use the
convolution operator to exploit this observation. They learn to encode the input in a
set of simple signals and then try to reconstruct the input from them, modify the
geometry or the reflectance of the image.
• Sparse Auto encoders
 Sparse auto encoders offer us an alternative method for introducing an information
bottleneck without requiring a reduction in the number of nodes at our hidden
layers.
 Penalizes activations of hidden layers so that only a few nodes are encouraged to
activate when a single sample is fed into the network.
• Deep Auto encoders
 The first layer of the Deep Auto encoder is used for first-order features in the raw
input. The second layer is used for second-order features corresponding
to patterns in the appearance of first-order features. Deeper layers of the Deep Auto
encoder tend to learn even higher-order features.
 A deep auto encoder is composed of two, symmetrical deep-belief networks:
41
 First four or five shallow layers representing the encoding half of the net.
BATCH NORMALIZATION
• Batch normalization is one of the important features we add to our
model helps as a Regularizer, normalizing the inputs, in the back
propagation process, and can be adapted to most of the models to
converge better.
• How Does Batch Normalization work?
• Batch normalization is a feature that we add between the layers of
the neural network, and it continuously takes the output from the
previous layer and normalizes it before sending it to the next layer.

42
BATCH NORMALIZATION
• This has the effect of stabilizing the neural network. Batch
normalization is also used to maintain the distribution of the data.
• The problem we have in neural networks is the internal covariate
shift. When we are training our neural network, the distribution of
data changes and the model trains slower.
• This problem is framed as an internal covariate shift. To maintain the
similar distribution of data we use batch normalization by
normalizing the outputs using mean=0, standard dev=1 (μ=0,σ=1).
• By using this technique, the model is trained faster, and it also
increases the accuracy of the model compared to a model that does
not use the batch normalization.

43
DROPOUT
• Dropout is implemented per-layer in a neural network.
• It can be used with most types of layers, such as dense fully
connected layers, convolutional layers, and recurrent layers such as
the long short-term memory network layer.
• Dropout may be implemented on any or all hidden layers in the
network as well as the visible or input layer. It is not used on the
output layer.
• The term “dropout” refers to dropping out units (hidden and
visible) in a neural network.
• Simply, dropout refers to ignoring units (i.e. neurons) during the
training phase of certain set of neurons which is chosen at random.
By “ignoring”, mean these units are not considered during a
particular forward or backward pass.

44
DROPOUT

45
DROPOUT
• More technically, at each training stage, individual nodes are either
dropped out of the net with probability 1-p or kept with probability p, so
that a reduced network is left; incoming and outgoing edges to a
dropped-out node are also removed.
• Neural networks are the building blocks of any machine-learning
architecture. They consist of one input layer, one or more hidden
layers, and an output layer.
• When we train our neural network (or model) by updating each of its
weights, it might become too dependent on the dataset we are using.
Therefore, when this model has to make a prediction or classification, it
will not give satisfactory results. This is known as over-fitting.
• We might understand this problem through a real-world example: If a
student of mathematics learns only one chapter of a book and then
takes a test on the whole syllabus, he will probably fail.
• To overcome this problem, we use a technique that was introduced by
Geoffrey Hinton in 2012. This technique is known as dropout.
46
REGULARIZATION
• Regularization is a technique to discourage the complexity of the
model. It does this by penalizing the loss function. This helps to
solve the overfitting problem.
• Let’s understand how penalizing the loss function helps
simplify the model
• Loss function is the sum of squared difference between the actual
value and the predicted value:

47
REGULARIZATION
• As the degree of the input features increases the model becomes
complex and tries to fit all the data points.
• When we penalize the weights θ_3 and θ_4 and make them too
small, very close to zero. It makes those terms negligible and helps
simplify the model.

• Regularization works on assumption that smaller weights generate


simpler model and thus helps avoid overfitting.

48
L1 NORMALIZATION
• LASSO regression, L1 regularization, includes a hyper-parameter α
times the sum of the absolute value of the coefficients as penalty
term in its cost function, shown below (marked in red):

• On the one hand, if we do not apply any penalty (set α =0), the
above formula turns into a regular OLS regression, which may
overfit.
• On the other hand, the model will probably underfit if we apply a
very large penalty (or, a large α value), because we have falsely
penalized all coefficients (the most important ones included).

49
L2 NORMALIZATION
• Ridge regression adopts a “squared magnitude” of coefficient times
lambda as penalty term, shown below.

• If lambda λ is 0, the formula becomes a regular OLS regression. The


penalty term of the cost function (marked out in red) increases the
biases of the model and makes the fit on the training data worse.
• L2 is called regularization for simplicity. Instead of shrinking to
zero, L2 regularization slows down as the rate goes towards 0. In
each iteration, L2 removes a small percentage of weights and so
never converges to 0.

50
MOMENTUM
• Momentum methods in the context of machine learning refer to a group of
tricks and techniques designed to speed up convergence of first order
optimization methods like gradient descent (and its many variants).
• They essentially work by adding what’s called the momentum term to the
update formula for gradient descent, thereby make it better than its natural
“zigzagging behavior,” especially in long narrow valleys of the cost function.
• The reason we do this is to avoid the algorithm getting stuck in a local
minimum. Think of it as a marble rolling around on a curved surface. We
want to get to the lowest point. The marble having momentum will allow it
to avoid a lot of small dips and make it more likely to find a better local
solution.
• Having momentum too high means you will be more likely to overshoot (the
marble goes through the local minimum but the momentum carries it back
upwards for a bit). This will lead to longer learning times.
• Finding the correct value of the momentum will depend on the particular
problem: the smoothness of the function, how many local minima you
expect, how “deep” the sub-optimal local minima are expected to be, etc.
51
MOMENTUM

52
TUNING HYPER PARAMETERS
• Hyper parameters that cannot be directly learned from the regular
training process are usually fixed before the actual training process
begins. These parameters express important properties of the model
such as its complexity or how fast it should learn.
• Some examples of model hyper parameters include:
 The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
 The learning rate for training a neural network.
 The C and sigma hyper parameters for support vector machines.
 The k in k-nearest neighbours.
• Models can have many hyper parameters and finding the best
combination of parameters can be treated as a search problem. Two
best strategies for Hyper parameter tuning are:
• GridSearchCV
• RandomizedSearchCV
53
GRIDSEARCHCV
• Machine learning model is evaluated for a range of hyper parameter
values. This approach is called GridSearchCV, because it searches for
best set of hyper parameters from a grid of hyper parameters values.
• For example, if we want to set two hyper parameters C and Alpha of
Logistic Regression Classifier model, with different set of values.
• The gridsearch technique will construct many versions of the model
with all possible combinations of hyper parameters, and will return the
best one.

• Drawback: GridSearchCV will go through all the intermediate


combinations of hyper parameters which makes grid search
54
computationally very expensive.
RANDOMIZEDSEARCHCV
• RandomizedSearchCV solves the drawbacks of GridSearchCV, as it
goes through only a fixed number of hyper parameter settings. It
moves within the grid in random fashion to find the best set hyper
parameters. This approach reduces unnecessary computation.
• RandomizedSearchCV implements a “fit” and a “score” method. It
also implements “score_samples”, “predict”, “predict_proba”,
“decision_function”, “transform” and “inverse_transform”.
• The parameters of the estimator used to apply these methods are
optimized by cross-validated search over parameter settings.
• In contrast to GridSearchCV, not all parameter values are tried out,
but rather a fixed number of parameter settings is sampled from the
specified distributions. The number of parameter settings that are
tried is given by n_iter.

55
56

You might also like