CS601_Machine Learning_Unit 2 New
CS601_Machine Learning_Unit 2 New
U N I T- I I
C S 6 0 1 - M AC H I N E L E A R N I N G
1
SYLLABUS & COURSE OUTCOME (UNIT-II)
• Unit –II Linearity vs non linearity, activation functions
like sigmoid, ReLU, etc., weights and bias, loss function,
gradient descent, multilayer network, backpropagation,
weight initialization, training, testing, unstable gradient
problem, auto encoders, batch normalization, dropout,
L1 and L2 regularization, momentum, tuning hyper
parameters,
2
TOPICS TO BE COVERED…
• Linearity Vs Non-Linearity
• Activation functions like Sigmoid, ReLU, etc.
• Weights, Bias, and Loss function
• Gradient Descent
• Multilayer Network
• Introduction to Back Propagation Network
• Back Propagation Training Algorithm
• Unstable Gradient Problem
• Auto Encoders
• Batch normalization, Dropout
• L1 and L2 Regularization,
• Momentum
• Tuning Hyper-parameters
3
LINEARITY VS NON-LINEARITY
• A linear model uses a linear function for its prediction function or as
a crucial part of its prediction function.
• A linear function takes a fixed number of numerical inputs x1, x2,…,
xn and weights w0,…,wn as the parameters of the model.
5
BIOLOGICAL NEURAL NETWORK
6
ARTIFICIAL NEURAL NETWORK
7
COMPARISON BNN VS ANN
8
ACTIVATION FUNCTIONS
9
ACTIVATION FUNCTIONS
10
WEIGHTS AND BIAS
11
LOSS FUNCTION
• A loss function, or cost function, is a wrapper, around our model predict
function that tells us “how good” the model is at making predictions for a
given set of parameters.
• The loss function has its own curve and its own derivatives. The slope of
this curve tells us how to change our parameters to make the model
more accurate. We use the model to make predictions.
• We use the cost function to update our parameters. Our cost function can
take a variety of forms as there are many different cost functions
available. Popular loss functions include: MSE (L2) and Cross-entropy
Loss.
• The loss function computes the error for a single training example. The
cost function is the average of the loss functions of the entire training
12
set.
LOSS FUNCTION EXAMPLES
• 1. Squared Error Loss
Squared Error loss for each training example, also known as L2 Loss, is
the square of the difference between the actual and the predicted values:
13
GRADIENT DESCENT
• Gradient descent is by far the most popular optimization strategy
used in machine learning and deep learning at the moment.
• It is used when training data models, can be combined with every
algorithm and is easy to understand and implement.
• Gradient Descent is an optimization algorithm for finding a local
minimum of a differentiable function. Gradient descent is simply used
to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible.
• Example: Imagine a blindfolded man who wants to climb to the top of
a hill with the fewest steps along the way as possible.
• He might start climbing the hill by taking really big steps in the
steepest direction, which he can do as long as he is not close to the
top.
• As he comes closer to the top, however, his steps will get smaller and
smaller to avoid overshooting it. This process can be described
mathematically using the gradient. 14
WORKING OF GRADIENT DESCENT
• Instead of climbing up a hill, think of gradient descent as hiking
down to the bottom of a valley. This is a better analogy because it is
a minimization algorithm that minimizes a given function.
• The equation below describes what gradient descent does: b is the
next position of our climber, while a represents his current position.
The minus sign refers to the minimization part of gradient descent.
The gamma in the middle is a waiting factor and the gradient term
( Δf(a) ) is simply the direction of the steepest descent.
15
IMPORTANCE OF LEARNING RATE
• How big the steps are gradient descent takes into the direction of
the local minimum are determined by the learning rate, which
figures out how fast or slow we will move towards the optimal
weights.
• For gradient descent to reach the local minimum we must set the
learning rate to an appropriate value, which is neither too low nor
too high.
16
IMPORTANCE OF LEARNING RATE
• A good way to make sure gradient descent runs properly is by
plotting the cost function as the optimization runs.
• Put the number of iterations on the x-axis and the value of the cost-
function on the y-axis. This helps you see the value of your cost
function after each iteration of gradient descent and provides a way
to easily spot how appropriate your learning rate is.
• If gradient descent is working properly, the cost function should
decrease after every iteration.
17
TYPES OF GRADIENT DESCENT
• Batch Gradient Descent
Batch gradient descent, also called vanilla gradient descent, calculates the
error for each example within the training dataset, but only after all training
examples have been evaluated does the model get updated.
This whole process is like a cycle, and it's called a training epoch.
• Stochastic Gradient Descent
By contrast, stochastic gradient descent (SGD) does this for each training
example within the dataset, meaning it updates the parameters for each
training example one by one. Depending on the problem, this can make SGD
faster than batch gradient descent.
One advantage is the frequent updates allow us to have a pretty detailed rate
of improvement.
• Mini-Batch Gradient Descent
Mini-batch gradient descent is the go-to method since it’s a combination of the
concepts of SGD and batch gradient descent.
It simply splits the training dataset into small batches and performs an update
for each of those batches. This creates a balance between the robustness of 18
stochastic gradient descent and the efficiency of batch gradient descent.
MULTILAYER NETWORK
19
MULTILAYER NETWORK
20
INTRODUCTION TO BACK PROPAGATION NETWORK
22
BACK PROPAGATION NETWORK ALGORITHM
23
BACK PROPAGATION NETWORK ALGORITHM
24
BACK PROPAGATION NETWORK ALGORITHM
25
BACK PROPAGATION NETWORK ALGORITHM
26
BACK PROPAGATION NETWORK ALGORITHM
27
BACK PROPAGATION NETWORK ALGORITHM
28
BACK PROPAGATION NETWORK ALGORITHM
29
EXAMPLE BACK PROPAGATION NETWORK
30
WEIGHT INITIALIZATION
31
WEIGHT INITIALIZATION
Terminology or Notations
• Following notations must be kept in mind while understanding
the Weight Initialization Techniques. These notations may vary
at different publications. However, the ones used here are the
most common, usually found in research papers.
32
WEIGHT INITIALIZATION
• Zero Initialization.
• Random Initialization
• Xavier/Glorot Initialization.
• He Uniform Initialization
• He Normal Initialization
34
TRAINING
• ML (machine learning) model training is the process of teaching an algorithm to make predictions or
identify patterns by exposing it to labeled data, and then repeatedly refining its parameters to minimize
the difference between its predictions and the true values in the data.
How it works:
• Data Collection: A dataset containing both input features and corresponding target values is collected.
• Data Preprocessing: The data is prepared by cleaning, transforming, and normalizing it to make it suitable
for the chosen ML model.
• Model Selection: A suitable ML algorithm is chosen based on the problem and the nature of the data.
• Model Training: The algorithm is trained using the prepared data. The algorithm iteratively adjusts its
parameters based on the discrepancy between its predictions and the true values, aiming to minimize this
difference.
• Evaluation: The trained model's performance is evaluated using unseen test data to assess its ability to
make accurate predictions on new, unknown data.
• Hyperparameter Tuning: The algorithm's parameters that are not learned from the data but are set before
training (hyperparameters) are tuned to optimize the model's performance.
• Deployment: The trained and evaluated model is deployed for making predictions or solving real-world
problems. 35
TESTING
39
ARCHITECTURE OF AUTO ENCODERS
• Encoder: This part of the network compresses the input into
a latent space representation. The encoder layer encodes the input
image as a compressed representation in a reduced dimension. The
compressed image is the distorted version of the original image.
40
TYPES OF AUTO ENCODERS
• Convolution Auto encoders
Auto encoders in their traditional formulation do not take into account the fact that a
signal can be seen as a sum of other signals. Convolutional Auto encoders use the
convolution operator to exploit this observation. They learn to encode the input in a
set of simple signals and then try to reconstruct the input from them, modify the
geometry or the reflectance of the image.
• Sparse Auto encoders
Sparse auto encoders offer us an alternative method for introducing an information
bottleneck without requiring a reduction in the number of nodes at our hidden
layers.
Penalizes activations of hidden layers so that only a few nodes are encouraged to
activate when a single sample is fed into the network.
• Deep Auto encoders
The first layer of the Deep Auto encoder is used for first-order features in the raw
input. The second layer is used for second-order features corresponding
to patterns in the appearance of first-order features. Deeper layers of the Deep Auto
encoder tend to learn even higher-order features.
A deep auto encoder is composed of two, symmetrical deep-belief networks:
41
First four or five shallow layers representing the encoding half of the net.
BATCH NORMALIZATION
• Batch normalization is one of the important features we add to our
model helps as a Regularizer, normalizing the inputs, in the back
propagation process, and can be adapted to most of the models to
converge better.
• How Does Batch Normalization work?
• Batch normalization is a feature that we add between the layers of
the neural network, and it continuously takes the output from the
previous layer and normalizes it before sending it to the next layer.
42
BATCH NORMALIZATION
• This has the effect of stabilizing the neural network. Batch
normalization is also used to maintain the distribution of the data.
• The problem we have in neural networks is the internal covariate
shift. When we are training our neural network, the distribution of
data changes and the model trains slower.
• This problem is framed as an internal covariate shift. To maintain the
similar distribution of data we use batch normalization by
normalizing the outputs using mean=0, standard dev=1 (μ=0,σ=1).
• By using this technique, the model is trained faster, and it also
increases the accuracy of the model compared to a model that does
not use the batch normalization.
43
DROPOUT
• Dropout is implemented per-layer in a neural network.
• It can be used with most types of layers, such as dense fully
connected layers, convolutional layers, and recurrent layers such as
the long short-term memory network layer.
• Dropout may be implemented on any or all hidden layers in the
network as well as the visible or input layer. It is not used on the
output layer.
• The term “dropout” refers to dropping out units (hidden and
visible) in a neural network.
• Simply, dropout refers to ignoring units (i.e. neurons) during the
training phase of certain set of neurons which is chosen at random.
By “ignoring”, mean these units are not considered during a
particular forward or backward pass.
44
DROPOUT
45
DROPOUT
• More technically, at each training stage, individual nodes are either
dropped out of the net with probability 1-p or kept with probability p, so
that a reduced network is left; incoming and outgoing edges to a
dropped-out node are also removed.
• Neural networks are the building blocks of any machine-learning
architecture. They consist of one input layer, one or more hidden
layers, and an output layer.
• When we train our neural network (or model) by updating each of its
weights, it might become too dependent on the dataset we are using.
Therefore, when this model has to make a prediction or classification, it
will not give satisfactory results. This is known as over-fitting.
• We might understand this problem through a real-world example: If a
student of mathematics learns only one chapter of a book and then
takes a test on the whole syllabus, he will probably fail.
• To overcome this problem, we use a technique that was introduced by
Geoffrey Hinton in 2012. This technique is known as dropout.
46
REGULARIZATION
• Regularization is a technique to discourage the complexity of the
model. It does this by penalizing the loss function. This helps to
solve the overfitting problem.
• Let’s understand how penalizing the loss function helps
simplify the model
• Loss function is the sum of squared difference between the actual
value and the predicted value:
47
REGULARIZATION
• As the degree of the input features increases the model becomes
complex and tries to fit all the data points.
• When we penalize the weights θ_3 and θ_4 and make them too
small, very close to zero. It makes those terms negligible and helps
simplify the model.
48
L1 NORMALIZATION
• LASSO regression, L1 regularization, includes a hyper-parameter α
times the sum of the absolute value of the coefficients as penalty
term in its cost function, shown below (marked in red):
• On the one hand, if we do not apply any penalty (set α =0), the
above formula turns into a regular OLS regression, which may
overfit.
• On the other hand, the model will probably underfit if we apply a
very large penalty (or, a large α value), because we have falsely
penalized all coefficients (the most important ones included).
49
L2 NORMALIZATION
• Ridge regression adopts a “squared magnitude” of coefficient times
lambda as penalty term, shown below.
50
MOMENTUM
• Momentum methods in the context of machine learning refer to a group of
tricks and techniques designed to speed up convergence of first order
optimization methods like gradient descent (and its many variants).
• They essentially work by adding what’s called the momentum term to the
update formula for gradient descent, thereby make it better than its natural
“zigzagging behavior,” especially in long narrow valleys of the cost function.
• The reason we do this is to avoid the algorithm getting stuck in a local
minimum. Think of it as a marble rolling around on a curved surface. We
want to get to the lowest point. The marble having momentum will allow it
to avoid a lot of small dips and make it more likely to find a better local
solution.
• Having momentum too high means you will be more likely to overshoot (the
marble goes through the local minimum but the momentum carries it back
upwards for a bit). This will lead to longer learning times.
• Finding the correct value of the momentum will depend on the particular
problem: the smoothness of the function, how many local minima you
expect, how “deep” the sub-optimal local minima are expected to be, etc.
51
MOMENTUM
52
TUNING HYPER PARAMETERS
• Hyper parameters that cannot be directly learned from the regular
training process are usually fixed before the actual training process
begins. These parameters express important properties of the model
such as its complexity or how fast it should learn.
• Some examples of model hyper parameters include:
The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
The learning rate for training a neural network.
The C and sigma hyper parameters for support vector machines.
The k in k-nearest neighbours.
• Models can have many hyper parameters and finding the best
combination of parameters can be treated as a search problem. Two
best strategies for Hyper parameter tuning are:
• GridSearchCV
• RandomizedSearchCV
53
GRIDSEARCHCV
• Machine learning model is evaluated for a range of hyper parameter
values. This approach is called GridSearchCV, because it searches for
best set of hyper parameters from a grid of hyper parameters values.
• For example, if we want to set two hyper parameters C and Alpha of
Logistic Regression Classifier model, with different set of values.
• The gridsearch technique will construct many versions of the model
with all possible combinations of hyper parameters, and will return the
best one.
55
56