NN and Optimization Regularization
NN and Optimization Regularization
Back Propagation
Dr. Santosh Kumar Vipparthi
Dept. of C.S.E
Website: https://visionintelligence.github.io/
Malaviya National Institute of Technology (MNIT), Jaipur
What is a Feature?
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
y = f(x)
System Framework
Testing
Image Learned
Prediction
Features model
Test Image Slide credit: D. Hoiem and L.
Lazebnik
Image Retrieval
MMI Dataset
Macro Expression
Disgust Expression
Happy expression
Credit: https://www.shutterstock.com
Slide credit: S K Vipparthi
Generalization
LTP
1 0 0 8 4 2
0 1 16 1 9
0 0 0 32 64 128
1 0 -1
0 1
0 0 1 8 4 2
-1 0 0
0 0 16 1 34
= 2[7,11] 1 0 0 32 64 128
Fig: Example of obtaining LBP and LTP for the 3 × 3 patternSlide credit: S K Vipparthi
Example
Credits to:
1. http://cs231n.stanford.edu/
2. http://cs231n.github.io/optimization-2/
3. http://neuralnetworksanddeeplearning.com/chap2.ht3
4. https://mattmazur.com/2015/03/17/
Neural Network
neuron
Input Layer 1 Layer 2 Layer Output
x1 … L y1
…
x2 … y2
…
…
…
…
…
…
…
…
…
…
…
xN … yM
…
Input Output
Layer Hidden Layers Layer
e.g. x = -2, y = 5, z = -4
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
gradients
gradients
gradients
gradients
sigmoid gate
sigmoid gate
f
gradients
4096-d 4096-d
f(x) = max(0,x) output vector
input vector
(elementwise)
Jacobian matrix
4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Q: what is the
size of the
Jacobian matrix?
Jacobian matrix
4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]
4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
in practice we process
Q: what is the
an entire minibatch (e.g.
size of the 100) of examples at one
Jacobian matrix? time:
[4096 x 4096!] i.e. Jacobian would technically be
a [409,600 x 409,600] matrix :\
Lecture 4 -
…
…
…
…
…
…
…
…
…
…
xN … yM
…
Input Output
Layer Hidden Layers Layer
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
……
……
……
……
……
xN x a1 ……
a2 y yM
=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
I2 w2
z a Out
b
• We are going to use a neural network with:
• two inputs,
• two hidden neurons,
• two output neurons.
• Additionally, the hidden and output neurons
will include a bias.
Input layer Hidden layer Output layer
Weights
w1 h w5 o
i1
Input values
1 1
w2 w6
Targets
w3 w7
h o
i2
w4 2 w8 2
b b
b1 b2
Basic Structure of NN
Here are the initial weights, the biases, and training inputs/outputs:
Outputs Targets
Net h1 Out h1
0.15w1 h 0.40 w5 Net o1
o
Out o1
0.25 w3 0.50 w7
0.10 h o 0.99
i2 0.7729
0.35 w4 2 0.55 w8 2
1 1
b1 0.35 b2 0.60
Example of NN
Forward Pass
Lets see what the neural network currently predicts given the weights and biases above and
inputs of 0.05 and 0.10.
=> Output for hidden layer with sigmoid activation function:
net h1= w1 * i1 + w1 * i2 + b1 *1
𝟏
out h1 = 𝟏+ 𝒆−𝒏𝒆𝒕 𝒉𝟏 (sigmoid activation function)
out h1 = 1
0.3775 ➔ 0.5932699
1+ 𝑒 −
similarly,
out h2 = 0.5968843
Repeat above process for the output layer neurons, using the output from the hidden layer
neurons as inputs.
𝟏
E total = σ (𝒕𝒂𝒓𝒈𝒆𝒕 − 𝒐𝒖𝒕𝒑𝒖𝒕 )𝟐
𝟐
1
E o1 = 2 (0.01 − 0.75136507)2 ➔ 0.274811083
E o2 = 0.023560026
i 0.15 w1 h 0.40 w5 o
1 1 1 0.0
0.20 w2 0.45 w6 1 Out h2 w
0.50 w7
Total Error 5
0.25 w3
i h
w
o Out h1 net net E
2 0.20 w4 2 0.55 w8 2 6
0.9 o1 o2 total
9
1 b2
b b
b1 0.35 b2 0.60
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
= ∗ 𝜕𝑛𝑒𝑡𝑜1 ∗
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑤5
1 1
E total = (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜1 − 𝑂𝑢𝑡 𝑜1)2 + (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜2 − 𝑂𝑢𝑡 𝑜2)2
2 2
Derivative w.r.t Out o1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= − (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜1 − 𝑂𝑢𝑡 𝑜1) + 0 = 0.74136507
𝜕𝑜𝑢𝑡𝑜1
1
Out o1 =
1+ 𝑒 −𝑛𝑒𝑡 𝑜1
𝜕𝑜𝑢𝑡𝑜1
= Out o1 (1 – Out o1) = 0.18681560
𝜕𝑛𝑒𝑡𝑜1
𝜕𝑛𝑒𝑡𝑜1
= Out h1 = 0.5932699
𝜕𝑤5
Constant are in RED color
Backward Propagation
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.082167041
𝜕𝑤5
Updation of weight w5 :
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
w5_new = w5 – η x
𝜕𝑤5
Find out updated values of weights w6, w7, w8 and bias b2 with the
same procedure.
w6_new = 0.408666186
w7_new = 0.511301270
w8_new = 0.561370121
h o
i2 E02
2 2
E total = E01+E02
1 1
b1 b2
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
1
Eo1 = (𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑂𝑢𝑡𝑜1)2
2
1
Eo2 = (𝑡𝑎𝑟𝑔𝑒𝑡𝑜2 − 𝑂𝑢𝑡𝑜2)2
2
(Eo1 and Eo2 not directly depend on outh1)
𝜕𝐸𝑜1
= 0.74136507 x 0.18681560 = 0.1384985
𝜕𝑛𝑒𝑡𝑜1
𝜕𝑛𝑒𝑡𝑜1
= ?
𝜕𝑜𝑢𝑡ℎ1
neto1 = w5 ∗ 𝑜𝑢𝑡ℎ1 + w6 ∗ 𝑜𝑢𝑡ℎ2 + 𝑏2 ∗ 1
𝜕𝑛𝑒𝑡𝑜1
= w5 = 0.40
𝜕𝑜𝑢𝑡ℎ1
1
Eo2 = (target o2 – out o2 )2
2
𝜕𝐸𝑜2
= -(target o2 – out o2) = -(0.99 – 0.772928)
𝜕𝑜𝑢𝑡𝑜2
𝜕𝐸𝑜2
= -0.217072
𝜕𝑜𝑢𝑡𝑜2
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑜𝑢𝑡𝑜2
= ∗
𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡𝑜2 𝜕𝑛𝑒𝑡𝑜2
1
Out o2 =
1+ 𝑒 −𝑛𝑒𝑡𝑜2
𝜕𝑜𝑢𝑡𝑜2
= Out o2 (1 – Out o2)
𝜕𝑛𝑒𝑡𝑜2
𝜕𝐸𝑜2
=(−0.217072) ∗ (0.1755100) = − 0.0380983
𝜕𝑛𝑒𝑡𝑜2
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑛𝑒𝑡𝑜2
= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑜2
=− 0.0380983
𝜕𝑛𝑒𝑡𝑜2
𝜕𝑛𝑒𝑡𝑜2
= w7 = 0.50
𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑛𝑒𝑡𝑜2
= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑜2
= − 0.0380983 ∗ 0.50 = − 0.0190491
𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.0553994 + − 0.0190491= 0.0363503
𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
1
outh1 =
1+ 𝑒 −𝑛𝑒𝑡 ℎ1
𝜕𝑜𝑢𝑡ℎ1
= outh1 x (1 - outh1) = 0.5932699 X (1- 0.5932699)
𝜕𝑛𝑒𝑡ℎ1
𝜕𝑜𝑢𝑡ℎ1
= 0.2413007
𝜕𝑛𝑒𝑡ℎ1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
neth1 = i1 ∗ w1 + i2 ∗ w2 + b1 ∗ 1
𝜕𝑛𝑒𝑡ℎ1
= i1 = 0.05
𝜕𝑤1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
=0.0363503 ∗ 0.2413007 ∗ 0.05
𝜕𝑤1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.00043856
𝜕𝑤1
Updation of weight w1 :
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
w1_new = w1 – η x
𝜕𝑤1
w1_new = 0.19956143
w2_new = 0.24975114
w3_new = 0.29950229
Outputs Targets
0.15w1 h 0.40 w5 o
0.05 i1 0.7513 0.01
1 1
0.20 w2 0.45 w6
0.25 w3 0.50 w7
0.10 h o 0.99
i2 0.7729
0.35 w4 2 0.55 w8 2
1 1
b1 0.35 b2 0.60
Example of NN
• Finally, we’ve updated all of our weights! When we fed forward the 0.05 and
0.1 inputs originally, the error on the network was 0.298371109.
• After this first round of backpropagation, the total error is now down to
0.291027924.
• It might not seem like much, but after repeating this process 10,000 times, for
example, the error plummets to 0.0000351085.
• At this point, when we feed forward 0.05 and 0.1, the two outputs neurons
generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99 target).
Drawbacks of Neural Networks
❑ The number of trainable parameters becomes extremely large.
Shift left
input output
class “3”
class “1”
class “2”
Classification (2)
● Each input can have only one label
○ One prediction per output class
■ The network will have “k” outputs (number of
classes)
• output
• Network
• input 0.1 class “1”
2 class “2”
1 class “3”
scores / logits
Classification (3)
● How can we create a loss function to improve the
scores?
○ Somehow write the labels (ground truth of the data)
into a vector → One-hot encoding
○ Non-probabilistic interpretation → hinge loss
○ Probabilistic interpretation: need to transform the
scores into a probability function → Softmax
Softmax
• Softmax layer as the output layer
Ordinary Layer
z1 ( )
y1 = z1
In general, the output of
z2 ( )
y2 = z2
network can be any value.
( )
May not be easy to interpret
z3 y3 = z3
● Convert scores into probabilities
○ From 0.0 to 1.0
○ Probability for all classes adds to 1.0
output
Network
input 0.1 0.1
2 0.7
1 0.2
0.12
3
e
1 2.7
z2 e e z2
y2 = e z2 zj
j =1
0.05 ≈0
z3 -3
3
y3 = ez3
z3
e e e zj
3 j =1
+ e zj
j =1
One-hot encoding
● Transform each label into a vector (with only 1 and 0)
○ Length equal to the total number of classes “k”
○ Value of 1 for the correct class and 0 elsewhere
1 0 0
0 1 0
0 0 1
Multi-label classification (1)
● Outputs can be matched to more than one label
○ “car”, “automobile”, “motor vehicle” can be applied to a
same image of a car.
● Use sigmoid at each output independently instead of
softmax
Multi-label classification (2)
2. An Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
3. http://cs231n.stanford.edu/
4. https://medium.com/@tm2761/regularization-hyperparameter-tuning-in-a-neural-network-f77c18c36cd3
5. https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0
6. https://srdas.github.io/DLBook/ImprovingModelGeneralization.html
7. http://laid.delanover.com/difference-between-l1-and-l2-regularization-implementation-and-visualization-in-tensorflow/
8. https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
9. https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2
10. https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135
11. https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82
12. https://medium.com/@krishna_84429/understanding-batch-normalization-1eaca8f2f63e
Some Basic Concepts
• Generalization Optimization
• Underfitting-Overfitting • Stochastic Gradient Descent
• Bias-Variance • Parameter Initialization
• Adagrad
Regularization • RMSProp
• Parameter Norm-Penalties. (L1- • Batch Normalization
norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble
methods
• Dropout
Data Management for Training
and Evaluation
Complete Dataset
Validation Set
Training Set Testing Set
(Optional)
Batch 1 Batch 1
Epoch 3
Batch 2 Batch 2
Epoch 4
Epoch1 Epoch2
Batch M Batch M
Epoch N
Validate Test Validate Test
Generalization
• The ability of a trained model to perform well over
the test data is known as its Generalization ability.
There are two kinds of problems that can afflict
machine learning models in general:
- Even after the model has been fully trained such
that its training error is small, it exhibits a high test
error rate. This is known as the problem of
Overfitting.
- The training error fails to come down in-spite of
several epochs of training. This is known as the
problem of Underfitting
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/
Recipe for Learning
Don’t overfittin
forget! g
Modify the Network Preventing
Better optimization Overfitting
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/
Underfitting-Overfitting
• This becomes especially problematic as you make your
model increasingly complex.
• Underfitting is a related issue where your model is not
complex enough to capture the underlying trend in the data.
• The problem of overfitting is not limited to computers,
humans are often no better.
Underfitting-Overfitting
• For instance, say you had a bad experience with an
XYZ Airline, maybe the service wasn’t good, or that
the airline was riddled with delays.
• You might be tempted to say that all flights on XYZ
airline are bad.
• This is called overfitting whereby we overgeneralize
something, which otherwise, might have been us
just having a bad day.
Underfitting-Overfitting
Epochs
Source: https://meditationsonbianddatascience.com/2017/05/11/overfitting-underfitting-how-well-does-your-model-fit/
Bias-Variance
• Bias is the difference between your
model's expected predictions and the true values.
• That might sound strange because shouldn't you
"expect" your predictions to be close to the true
values?
• Well, it's not always that easy because some
algorithms are simply too rigid to learn complex
signals from the dataset.
Bias-Variance
• Imagine fitting a linear regression to a dataset that
has a non-linear pattern:
Illustrates the relationship between model capacity and the concepts of underfitting and overfitting by plotting the
training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors
are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then
starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum
Regularization in Machine Learning
• How to make an algorithm/model perform well not
just on the training data, but also on new inputs?
Learning curves showing how the negative log-likelihood loss changes over time (indicated as number of training
iterations over the dataset, or epochs). In this example, a network is trained on MNIST. Observe that the training
objective decreases consistently over time, but the validation set average loss eventually begins to increase again,
forming an asymmetric U-shaped curve
Bagging and Other Ensemble
Methods
• Bagging is a technique for reducing generalization
error by combining several models.
• The idea is to train several different models
separately, then have all of the models vote on the
output for test examples.
• This is an example of a general strategy in machine
learning called model averaging. Techniques
employing this strategy are known as ensemble
methods
Bagging and Other Ensemble
Methods
A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector on the dataset depicted above, containing an ‘8’,
a ‘6’ and a ‘9’. Suppose we make two different resampled datasets. The bagging training procedure is to construct each of
these datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats the ‘8’. On this dataset, the detector
learns that a loop on top of the digit corresponds to an ‘8’. On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this
case, the detector learns that a loop on the bottom of the digit corresponds to an ‘8’. Each of these individual classification
rules is brittle, but if we average their output then the detector is robust, achieving maximal confidence only when both loops
of the ‘8’ are present.
Dropout
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:
Thinner!
➢ No dropout
⚫ If the dropout rate at training is p%, all the weights times (1-p)%
partner
➢ When teams up, if everyone expect the partner will do the work, nothing will be
done finally.
➢ However, if you know your partner will dropout, you will do better.
Trainin
Ensemble g Set
Ensemble
Testing data x
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
……
2M
possible
networks
➢Using one mini-batch to train one network
➢Some parameters in the network are shared
Dropout is a kind of ensemble.
All the
weights
……
multiply
(1-p)%
y1 y2 y3
average ≈ y
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
Usage of initializers
Initializations define the way to set the initial random weights of layers.
Available initializers
1. Zeros
2. Ones
3. Constant
4. Random Normal
5. Random Uniform
6. Truncated Normal
7. Variance scaling
8. Orthogonal
9. Identity
10.Lecun_uniform
11.Glorat_normal
12.Glorat_uniform
13.He_normal
14.Lecun_normal
15.Custum initializaion
Optimization
• Optimization is the most essential ingredient in the
recipe of machine learning algorithms. It starts with
defining some kind of loss function/cost function
and ends with minimizing it using one or the other
optimization routine.
• The choice of optimization algorithm can make a
difference between getting a good accuracy in
hours or days.
How Learning Differs from Pure
Optimization
• Typically, the cost function with respect to the training
set can be written as:
Pros:
Guaranteed to converge to global minimum for convex error
surfaces and to a local minimum for non-convex surfaces.
Cons:
Very slow.
Intractable for datasets that do not fit in memory.
No online learning.
Stochastic gradient descent
Computes update for each example x (i )y (i ).
Update equation: θ = θ − η · ∇θ J(θ; x(i ); y (i ) )
f o r i i n r a n g e ( n b _ e p o c h s ) : n p . random .
s h u f f l e ( d a t a ) f o r exa m p l e i n d a t a :
Sebastian Ruder
p a ra m s _ g r a d = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n ,
exa m p le , p a r a m s )
p a ra m s = p a ra m s - l e a r n i n g _ r a t e *p a r a m s _ g r a d
Listing: Code for stochastic gradient descent update
Stochastic gradient descent
Pros
Much faster than batch gradient descent.
Allows online learning.
Cons
High variance updates.
Sebastian Ruder
168 /
Sebastian Ruder 24.11.17 49
SGD shows same convergence behaviour as batch gradient descent if learning rate
is slowly decreased (annealed) over time.
Mini-batch gradient descent
Performs update for every mini-batch of n examples. Update
equation: θ = θ − η · ∇θJ(θ;x(i :i +n) ; y(i :i +n ) )
fo r i in ra n ge ( nb_epochs ) :
n p . ra n d o m . s h u f f l e ( d a t a )
f o r b a t c h i n g e t _ b a t c h e s ( d a t a , b a t c h _ s i z e =5 0 ) :
Sebastian Ruder
params_ g ra d = e v a l u a t e _ g r a d i e n t (
l o s s _ f u n c t i o n , batch , params)
params = params - l e a r n i n g _ r a t e * params_ g ra d
Listing : Code for mini-batch gradient descent update
Mini-batch gradient descent
Pros
Reduces variance of updates.
Can exploit matrix multiplication primitives.
Cons
Mini-batch size is a hyperparameter. Common sizes are50-256.
Sebastian Ruder
• In this case, we make the explicit assumption that there is a linear relationship between X and
Y—that is, for each one-unit increase in X, we see a constant increase (or decrease) in Y.
Linear Regression with GD
• Our goal is to learn the model parameters (in this
case, β0 and β1) that minimize error in the model’s
predictions.
• To find the best parameters:
• Define a cost function, error function or loss
function, that measures how inaccurate our
model’s predictions are.
• Find the parameters that minimize loss, i.e. make
our model as accurate as possible.
Linear Regression with GD
A note on dimensionality:
• Our example is 2-dimensional for simplicity, but you’ll
typically have more features (x’s) and coefficients (betas)
in your model.
• For example: When adding more relevant variables to
improve the accuracy of your model predictions. The
same principles generalize to higher dimensions, though
things get much harder to visualize beyond 3 dimensions.
Linear Regression with GD
• Mathematically, we look at the difference between each real
data point (y) and our model’s prediction (ŷ).
• This is a measure of how well our data fits the line.
Cost = i
1i 0 i
(( x + ) − y ) 2
n = no. of observations. 2 n
Linear Regression with GD
Cos
3 t
4 2.5
COST
3 2
1.5
Y
2
1
1 0.5
0
0
0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X
Cos
3 t
4
COST
3 2
Y
2
1
1
0
0
0 1 2 3 4
0 1 2 3 4
Β1
X
Cos
3 t
4 2.5
COST
2
3
1.5
Y
2 1
1 0.5
0
0 0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X
if β1=0, then Cost(β1=0) = 2.3
Linear Regression with GD
5.25
Cos
3 t
4 2.5
COST
3 2
1.5
Y
2
1
1 0.5
0
0
0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X
5.25
Cos
3 t
4 2.5
COST
3 2
1.5
Y
2
1
1 0.5
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5 3
Β1
X
Linear Regression with GD
For Different Values of β1, it turns out to be like this
5.25
Cos
t
4 3
2.5
3
COST
2
1.5
Y
2
1
1
0.5
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5 3
X Β1
1 = 1 − Cost (0 , 1 )
1
Linear Regression with GD
1i 0 i
Cost = i
(( x + ) − y ) 2
2n y = 0 + 1 x +
Batch Normalization
• The intuition: Most of the deep models are compositions of many layers (or
functions) and the gradient with respect to one layer is taken considering the
other layers to be constant.
Optimization Strategies… Batch Normalization
• Mathematical Intuition: BN is about normalizing the hidden units activation
values so that the distribution of these activations remains same during
training.
• This slows down the training a lot. The change in distribution of the hidden
activations during is called internal covariate shift which effect the training
speed of the network.
Optimization Strategies… Batch Normalization
• Now we can normalize the kth hidden unit activation using the formula bellow.
Optimization Strategies… Batch Normalization
• For this, we introduce 2 new variables, one for learning the mean and other
for variance.
• These parameters are learned and updated along with weights and biases
during training. The final normalized scaled and shifted version of the hidden
activation for the kth hidden unit is given bellow.
Optimization Strategies… Batch Normalization
Batch Normalization
Optimization Strategies… Batch Normalization
• During testing or inference phase we can’t apply the same BN as we did during
training because we might pass only one sample at a time so it doesn’t make
sense to find mean and variance on a single sample.
• We compute the running average of mean and variance of kth unit during
training and use those mean and variance values with trained batch-norm
parameters during testing phase.
Recipe for Learning
Don’t overfittin
forget! g
Modify the Network Preventing
Better optimization Overfitting
Strategy
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/