0% found this document useful (0 votes)
17 views

NN and Optimization Regularization

Uploaded by

b001230002gideon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

NN and Optimization Regularization

Uploaded by

b001230002gideon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 198

Deep Neural Networks and

Back Propagation
Dr. Santosh Kumar Vipparthi
Dept. of C.S.E
Website: https://visionintelligence.github.io/
Malaviya National Institute of Technology (MNIT), Jaipur
What is a Feature?

A feature is a significant piece of information extracted from a


data/image which provides more detailed understanding of the
data/image.
A Picture Is Worth More Than A Thousand Words

Image analysis (understanding),


image processing & computer vision plays an important role in society
today because
 A picture gives a much clearer impression of a situation or an
object.
 Having an accurate visual perspective of things has a high social,
technical and economic value.
The machine learning framework

• Apply a prediction function to a feature representation of


the image to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”

Slide credit: L. Lazebnik


The machine learning framework

y = f(x)

output prediction Image


function feature

• Training: given a training set of labeled examples


{(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set

• Testing: apply f to a never before seen test example x and


output the predicted value y = f(x)

Slide credit: L. Lazebnik


Machine learning structure

System Framework

Slide credit: L. Lazebnik


Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model

Testing

Image Learned
Prediction
Features model
Test Image Slide credit: D. Hoiem and L.
Lazebnik
Image Retrieval

Slide credit: L. Lazebnik


Object Detection & Classification

Slide credit: L. Lazebnik


Macro v/s Micro Expressions

Disgust Happy Surprise

Anger Fear Sad

MMI Dataset

Macro Expression

Slide credit: S K Vipparthi


Sample Micro Expression

Disgust Expression

Happy expression

Slide credit: S K Vipparthi


Regular Vs Aerial View

Regular View Aerial View

Difference between regular and aerial view

Slide credit: S K Vipparthi


Sample Results

Slide credit: S K Vipparthi


Image Captioning

Automatically describing the


content of an image and generate
a reasonable description in plain
English. NIC(Neural Image
Caption) is model which take
image in input and generate
description.

Credit: https://www.shutterstock.com
Slide credit: S K Vipparthi
Generalization

Training set (labels known) Test set (labels


unknown)

How well does a learned model generalize from the data it


was trained on to a new test set?
Slide credit: L. Lazebnik
Local Binary & Ternary Patterns (LBP & LTP)
Pattern
9 7 5 2 1 LBP
3 17 9 1 4 1 1 0 8 4 2
6 8 9 14 5 1 1 16 1 93
2 2 10 7 6 0 1 0 32 64 128
1 9 8 9 10

LTP
1 0 0 8 4 2
0 1 16 1 9
0 0 0 32 64 128
1 0 -1
0 1
0 0 1 8 4 2
-1 0 0
0 0 16 1 34
 = 2[7,11] 1 0 0 32 64 128

Fig: Example of obtaining LBP and LTP for the 3 × 3 patternSlide credit: S K Vipparthi
Example

Slide credit: L. Lazebnik


Quantitative Analysis

TABLE II TABLE III


recognition accuracy comparison on MMI dataset recognition accuracy comparison on GEMEP-FERA dataset

6-Class 7-Class 5-Class 6-Class


Methods Methods
Exp. Exp. Exp. Exp.
LBP [9] 76.5 81.7 LBP [9] 92.2 87.8
Two-Phase [10] 75.4 82.0 Two-Phase
88.6 85.0
LDP [11] 80.5 84.0 [10]
LDN [12] 80.5 83.0 LDP [11] 94.0 90.0
LDTexP [13] 83.4 86.0
LDN [12] 93.4 91.0
LDTerP [14] 80.6 80.0
LDTexP
Spatio- 94.0 91.8
81.2 - [13]
Temopral* [25]
QUEST 94.3 91.33
QUEST 83.05 84.0

Slide credit: S K Vipparthi


Learning
Feed Forward & Backpropagation in Neural
Networks

Credits to:
1. http://cs231n.stanford.edu/
2. http://cs231n.github.io/optimization-2/
3. http://neuralnetworksanddeeplearning.com/chap2.ht3
4. https://mattmazur.com/2015/03/17/
Neural Network
neuron
Input Layer 1 Layer 2 Layer Output
x1 … L y1

x2 … y2








xN … yM

Input Output
Layer Hidden Layers Layer

Deep means many hidden layers


Slide credit: L. Lazebnik
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

Slide credit: Fei-Fei Li


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:

Slide credit: Fei-Fei Li


f

Slide credit: Fei-Fei Li


“local gradient”

Slide credit: Fei-Fei Li


“local gradient”

gradients

Slide credit: Fei-Fei Li


“local gradient”

gradients

Slide credit: Fei-Fei Li


“local gradient”

gradients

Slide credit: Fei-Fei Li


“local gradient”

gradients

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

(-1) * (-0.20) = 0.20

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

[local gradient] x [upstream gradient]


[1] x [0.2] = 0.2
[1] x [0.2] = 0.2 (both inputs!)

Slide credit: Fei-Fei Li


Another example:

Slide credit: Fei-Fei Li


Another example:

[local gradient] x [upstream gradient]


x0: [2] x [0.2] = 0.4
w0: [-1] x [0.2] = -0.2

Slide credit: Fei-Fei Li


sigmoid function

sigmoid gate

Slide credit: Fei-Fei Li


sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

Slide credit: Fei-Fei Li


Patterns in backward flow

add gate: gradient distributor

Slide credit: Fei-Fei Li


Patterns in backward flow

add gate: gradient distributor


Q: What is a max gate?

Slide credit: Fei-Fei Li


Patterns in backward flow

add gate: gradient distributor


max gate: gradient router

Slide credit: Fei-Fei Li


Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
Q: What is a mul gate?

Slide credit: Fei-Fei Li


Patterns in backward flow

add gate: gradient distributor


max gate: gradient router
mul gate: gradient switcher

Slide credit: Fei-Fei Li


Gradients add at branches

Slide credit: Fei-Fei Li


Gradients for vectorized code (x,y,z are This is now the
now vectors) Jacobian matrix
(derivative of each
element of z w.r.t. each
“local gradient” element of x)

f
gradients

Slide credit: Fei-Fei Li


Vectorized operations

4096-d 4096-d
f(x) = max(0,x) output vector
input vector
(elementwise)

Slide credit: Fei-Fei Li


Vectorized operations

Jacobian matrix

4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Q: what is the
size of the
Jacobian matrix?

Slide credit: Fei-Fei Li


Vectorized operations

Jacobian matrix

4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
Q: what is the
size of the
Jacobian matrix?
[4096 x 4096!]

Slide credit: Fei-Fei Li


Vectorized operations

4096-d 4096-d
input vector f(x) = max(0,x) output vector
(elementwise)
in practice we process
Q: what is the
an entire minibatch (e.g.
size of the 100) of examples at one
Jacobian matrix? time:
[4096 x 4096!] i.e. Jacobian would technically be
a [409,600 x 409,600] matrix :\
Lecture 4 -

Slide credit: Fei-Fei Li


Neural Network
neuron
Input Layer 1 Layer 2 Layer Output
x1 … L y1

x2 … y2








xN … yM

Input Output
Layer Hidden Layers Layer

Deep means many hidden layers


Slide credit: L. Lazebnik
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL

Slide credit: L. Lazebnik


Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……

……
xN x a1 ……
a2 y yM

Using parallel computing techniques to speed


y =𝑓 x
up matrix operation

=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL

Slide credit: L. Lazebnik


Back Propagation In NN

• Every Hidden node and output has 2 values:


• Net value (z)
• Out value (a)
𝑧 = w1 ∗ 𝑖1 + w2 ∗ 𝑖2 + bias
a is activation function
1
a = 1+ 𝑒 −𝑧 (Sigmoid/tanh/ReLu)
I1
w1

I2 w2
z a Out

b
• We are going to use a neural network with:
• two inputs,
• two hidden neurons,
• two output neurons.
• Additionally, the hidden and output neurons
will include a bias.
Input layer Hidden layer Output layer

Weights

w1 h w5 o
i1
Input values

1 1
w2 w6

Targets
w3 w7

h o
i2
w4 2 w8 2

b b
b1 b2

Basic Structure of NN
Here are the initial weights, the biases, and training inputs/outputs:

Outputs Targets
Net h1 Out h1
0.15w1 h 0.40 w5 Net o1
o
Out o1

0.05 i1 0.7513 0.01


1 1
0.20 w2 0.45 w6

0.25 w3 0.50 w7

0.10 h o 0.99
i2 0.7729
0.35 w4 2 0.55 w8 2

1 1
b1 0.35 b2 0.60

Example of NN
Forward Pass
Lets see what the neural network currently predicts given the weights and biases above and
inputs of 0.05 and 0.10.
=> Output for hidden layer with sigmoid activation function:

net h1= w1 * i1 + w1 * i2 + b1 *1

net h1 = 0.05 * 0.15 + 0.2 * 0.1 + 0.35 * 1 ➔ 0.3775

𝟏
out h1 = 𝟏+ 𝒆−𝒏𝒆𝒕 𝒉𝟏 (sigmoid activation function)

out h1 = 1
0.3775 ➔ 0.5932699
1+ 𝑒 −

similarly,

out h2 = 0.5968843
Repeat above process for the output layer neurons, using the output from the hidden layer
neurons as inputs.

net o1 = w 5 * out h1 + w 6 * out h2 + b 2 * 1

net o1 = 0.4 x 0.5932699 + 0.45 x 0.5968843 + 0.6 x 1 ➔ 1.105905967


1
out o1 = ➔ 0.75136507 (Out1 but target is 0.01)
1+ 𝑒 −1.105905

out o2 =0.772928 (Out2 but target is 0.99)


Total Error
We can now calculate the error for each output neuron using the squared error
function and sum them to get the total error:

𝟏
E total = σ (𝒕𝒂𝒓𝒈𝒆𝒕 − 𝒐𝒖𝒕𝒑𝒖𝒕 )𝟐
𝟐

E total = Eo1 + Eo2

1
E o1 = 2 (0.01 − 0.75136507)2 ➔ 0.274811083

E o2 = 0.023560026

E total = Eo1 + Eo2 = 0.298371109


Backward Propagation

For output layer :


𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
= ∗ 𝜕𝑛𝑒𝑡𝑜1 ∗
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑤5

i 0.15 w1 h 0.40 w5 o
1 1 1 0.0
0.20 w2 0.45 w6 1 Out h2 w
0.50 w7
Total Error 5
0.25 w3
i h
w
o Out h1 net net E
2 0.20 w4 2 0.55 w8 2 6
0.9 o1 o2 total
9
1 b2
b b
b1 0.35 b2 0.60
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
= ∗ 𝜕𝑛𝑒𝑡𝑜1 ∗
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑤5

1 1
E total = (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜1 − 𝑂𝑢𝑡 𝑜1)2 + (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜2 − 𝑂𝑢𝑡 𝑜2)2
2 2
Derivative w.r.t Out o1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= − (𝑡𝑎𝑟𝑔𝑒𝑡 𝑜1 − 𝑂𝑢𝑡 𝑜1) + 0 = 0.74136507
𝜕𝑜𝑢𝑡𝑜1

1
Out o1 =
1+ 𝑒 −𝑛𝑒𝑡 𝑜1

𝜕𝑜𝑢𝑡𝑜1
= Out o1 (1 – Out o1) = 0.18681560
𝜕𝑛𝑒𝑡𝑜1

net o1 = w5 x out h1+ w6 x out h2 + b2 x 1

𝜕𝑛𝑒𝑡𝑜1
= Out h1 = 0.5932699
𝜕𝑤5
Constant are in RED color
Backward Propagation

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1


= ∗ ∗
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.082167041
𝜕𝑤5

Updation of weight w5 :

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
w5_new = w5 – η x
𝜕𝑤5

W5_new = 0.40 – 0.5 x 0.082167 = .358916

η is learning rate here 0.5


w5 is now updated to w5_new
In next Forward pass w5_new is used

Find out updated values of weights w6, w7, w8 and bias b2 with the
same procedure.

w6_new = 0.408666186
w7_new = 0.511301270
w8_new = 0.561370121

*Remember new values only considered in next Forward pass after


complete updation of weights.
Next, we’ll continue the backwards pass by calculating new values for w1

For Hidden Layer :


w1 w5 o
i1 ne ou E01
t t 1
h1

h o
i2 E02
2 2

E total = E01+E02

1 1
b1 b2
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1

Etotal = Eo1 + Eo2

1
Eo1 = (𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑂𝑢𝑡𝑜1)2
2
1
Eo2 = (𝑡𝑎𝑟𝑔𝑒𝑡𝑜2 − 𝑂𝑢𝑡𝑜2)2
2
(Eo1 and Eo2 not directly depend on outh1)

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2


= +
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1

we will take both separately


𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2
= 𝜕𝑜𝑢𝑡ℎ1 + 𝜕𝑜𝑢𝑡ℎ1
𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑛𝑒𝑡𝑜1


= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑜𝑢𝑡𝑜1


= ∗ Both already calculated
𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1

𝜕𝐸𝑜1
= 0.74136507 x 0.18681560 = 0.1384985
𝜕𝑛𝑒𝑡𝑜1

𝜕𝑛𝑒𝑡𝑜1
= ?
𝜕𝑜𝑢𝑡ℎ1
neto1 = w5 ∗ 𝑜𝑢𝑡ℎ1 + w6 ∗ 𝑜𝑢𝑡ℎ2 + 𝑏2 ∗ 1

𝜕𝑛𝑒𝑡𝑜1
= w5 = 0.40
𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑛𝑒𝑡𝑜1


= ∗ = 0.1384985 x 0.40 = 0.0553994
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑛𝑒𝑡𝑜2
= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑜𝑢𝑡𝑜2


= ∗ This time both not calculated
𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡𝑜2 𝜕𝑛𝑒𝑡𝑜2

1
Eo2 = (target o2 – out o2 )2
2
𝜕𝐸𝑜2
= -(target o2 – out o2) = -(0.99 – 0.772928)
𝜕𝑜𝑢𝑡𝑜2

𝜕𝐸𝑜2
= -0.217072
𝜕𝑜𝑢𝑡𝑜2
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑜𝑢𝑡𝑜2
= ∗
𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡𝑜2 𝜕𝑛𝑒𝑡𝑜2

1
Out o2 =
1+ 𝑒 −𝑛𝑒𝑡𝑜2

𝜕𝑜𝑢𝑡𝑜2
= Out o2 (1 – Out o2)
𝜕𝑛𝑒𝑡𝑜2

= (0.7729284)(1 − 0.7729284) = 0.1755100

𝜕𝐸𝑜2
=(−0.217072) ∗ (0.1755100) = − 0.0380983
𝜕𝑛𝑒𝑡𝑜2
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑛𝑒𝑡𝑜2
= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜2
=− 0.0380983
𝜕𝑛𝑒𝑡𝑜2

neto2 = w7 ∗ 𝑜𝑢𝑡ℎ1 + w8 ∗ 𝑜𝑢𝑡ℎ2 + 𝑏2 ∗ 1

𝜕𝑛𝑒𝑡𝑜2
= w7 = 0.50
𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑜2 𝜕𝐸𝑜2 𝜕𝑛𝑒𝑡𝑜2
= ∗
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜2 𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑜2
= − 0.0380983 ∗ 0.50 = − 0.0190491
𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1


= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2


= +
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.0553994 + − 0.0190491= 0.0363503
𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1

1
outh1 =
1+ 𝑒 −𝑛𝑒𝑡 ℎ1

𝜕𝑜𝑢𝑡ℎ1
= outh1 x (1 - outh1) = 0.5932699 X (1- 0.5932699)
𝜕𝑛𝑒𝑡ℎ1

𝜕𝑜𝑢𝑡ℎ1
= 0.2413007
𝜕𝑛𝑒𝑡ℎ1
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= ∗ ∗
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1

neth1 = i1 ∗ w1 + i2 ∗ w2 + b1 ∗ 1

𝜕𝑛𝑒𝑡ℎ1
= i1 = 0.05
𝜕𝑤1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
=0.0363503 ∗ 0.2413007 ∗ 0.05
𝜕𝑤1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
= 0.00043856
𝜕𝑤1
Updation of weight w1 :

𝜕𝐸𝑡𝑜𝑡𝑎𝑙
w1_new = w1 – η x
𝜕𝑤1

w1_new = 0.15 – 0.5 ∗ 0.00043856 = 0.149780

With the same procedure weights w2 w3 w4 and bias b1 will be


computed.

w1_new = 0.19956143
w2_new = 0.24975114
w3_new = 0.29950229
Outputs Targets
0.15w1 h 0.40 w5 o
0.05 i1 0.7513 0.01
1 1
0.20 w2 0.45 w6

0.25 w3 0.50 w7

0.10 h o 0.99
i2 0.7729
0.35 w4 2 0.55 w8 2

1 1
b1 0.35 b2 0.60

Example of NN
• Finally, we’ve updated all of our weights! When we fed forward the 0.05 and
0.1 inputs originally, the error on the network was 0.298371109.
• After this first round of backpropagation, the total error is now down to
0.291027924.
• It might not seem like much, but after repeating this process 10,000 times, for
example, the error plummets to 0.0000351085.
• At this point, when we feed forward 0.05 and 0.1, the two outputs neurons
generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99 target).
Drawbacks of Neural Networks
❑ The number of trainable parameters becomes extremely large.

Slide credit: Fei-Fei Li


Drawbacks of Neural Networks

❑ Little or no invariance to shifting, scaling, and other forms of distortion

Slide credit: Fei-Fei Li


Drawbacks of Neural Networks

❑ Little or no invariance to shifting, scaling, and other forms of distortion

Shift left

Slide credit: Fei-Fei Li


Drawbacks of Neural Networks

❑ Little or no invariance to shifting, scaling, and other forms of distortion

Slide credit: Fei-Fei Li


Definition of Loss
In a supervised deep learning context the loss function
measures the quality of a particular set of parameters based on
how well the output of the network agrees with the ground
truth labels in the training data.
Nomenclature
loss function
=
cost function
=
objective function
=
error function
Loss function (1)
How good does our
Deep Network network with the
training data?

input output

labels (ground truth)


input

error parameters (weights, biases)


Common types of loss functions (1)
● Loss functions depen on the type of task:
○ Regression: the network predicts continuous, numeric
variables
■ Example: Length of fishes in images,
temperature from latitude/longitud
■ Absolute value, square error
Common types of loss functions (2)
● Loss functions depen on the type of task:
○ Classification: the network predicts categorical
variables (fixed number of classes)
■ Example: classify email as spam, predict student
grades from essays.
■ hinge loss, Cross-entropy loss
Absolute value, L1-norm
● Very intuitive loss function
○ produces sparser solutions
■ good in high dimensional
spaces
■ prediction speed
○ less sensitive to outliers
Square error, Euclidean loss, L2-norm
● Very common loss function
○ More precise and better than L1-norm
○ Penalizes large errors more strongly
○ Sensitive to outliers
Classification (1)
We want the network to classify the input into a fixed number
of classes

class “3”
class “1”
class “2”
Classification (2)
● Each input can have only one label
○ One prediction per output class
■ The network will have “k” outputs (number of
classes)
• output
• Network
• input 0.1 class “1”

2 class “2”

1 class “3”

scores / logits
Classification (3)
● How can we create a loss function to improve the
scores?
○ Somehow write the labels (ground truth of the data)
into a vector → One-hot encoding
○ Non-probabilistic interpretation → hinge loss
○ Probabilistic interpretation: need to transform the
scores into a probability function → Softmax
Softmax
• Softmax layer as the output layer

Ordinary Layer

z1  ( )
y1 =  z1
In general, the output of
z2  ( )
y2 =  z2
network can be any value.

( )
May not be easy to interpret
z3  y3 =  z3
● Convert scores into probabilities
○ From 0.0 to 1.0
○ Probability for all classes adds to 1.0
output
Network
input 0.1 0.1

2 0.7

1 0.2

scores / logits probability


Softmax
Probability:
• Softmax layer as the output layer ◼ 1 > 𝑦𝑖 > 0
◼ σ𝑖 𝑦𝑖 = 1
Softmax Layer
0.88 3
z1
3
e e z1 20
 y1 = e z1
 e
j =1
zj

0.12

3
e
1 2.7
z2 e e z2
y2 = e z2 zj

j =1
0.05 ≈0
z3 -3 
3
y3 = ez3 
z3
e e e zj

3 j =1
+ e zj

j =1
One-hot encoding
● Transform each label into a vector (with only 1 and 0)
○ Length equal to the total number of classes “k”
○ Value of 1 for the correct class and 0 elsewhere

class “1” class “2” class “3”

1 0 0

0 1 0

0 0 1
Multi-label classification (1)
● Outputs can be matched to more than one label
○ “car”, “automobile”, “motor vehicle” can be applied to a
same image of a car.
● Use sigmoid at each output independently instead of
softmax
Multi-label classification (2)

● Cross-entropy loss for multi-label classification:


Regularization and Optimization in
Deep Learning
Credits:

1. Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Corville

2. An Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

3. http://cs231n.stanford.edu/

4. https://medium.com/@tm2761/regularization-hyperparameter-tuning-in-a-neural-network-f77c18c36cd3

5. https://chatbotslife.com/regularization-in-deep-learning-f649a45d6e0

6. https://srdas.github.io/DLBook/ImprovingModelGeneralization.html

7. http://laid.delanover.com/difference-between-l1-and-l2-regularization-implementation-and-visualization-in-tensorflow/

8. https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

9. https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-i-20ae75984cb2

10. https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135

11. https://medium.com/deeper-learning/glossary-of-deep-learning-batch-normalisation-8266dcd2fa82

12. https://medium.com/@krishna_84429/understanding-batch-normalization-1eaca8f2f63e
Some Basic Concepts
• Generalization Optimization
• Underfitting-Overfitting • Stochastic Gradient Descent
• Bias-Variance • Parameter Initialization
• Adagrad
Regularization • RMSProp
• Parameter Norm-Penalties. (L1- • Batch Normalization
norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble
methods
• Dropout
Data Management for Training
and Evaluation
Complete Dataset

Validation Set
Training Set Testing Set
(Optional)

Batch 1 Batch 1
Epoch 3
Batch 2 Batch 2
Epoch 4
Epoch1 Epoch2

Batch M Batch M
Epoch N
Validate Test Validate Test
Generalization
• The ability of a trained model to perform well over
the test data is known as its Generalization ability.
There are two kinds of problems that can afflict
machine learning models in general:
- Even after the model has been fully trained such
that its training error is small, it exhibits a high test
error rate. This is known as the problem of
Overfitting.
- The training error fails to come down in-spite of
several epochs of training. This is known as the
problem of Underfitting
Recipe for Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/
Recipe for Learning

Don’t overfittin
forget! g
Modify the Network Preventing
Better optimization Overfitting
Strategy

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/
Underfitting-Overfitting
• This becomes especially problematic as you make your
model increasingly complex.
• Underfitting is a related issue where your model is not
complex enough to capture the underlying trend in the data.
• The problem of overfitting is not limited to computers,
humans are often no better.
Underfitting-Overfitting
• For instance, say you had a bad experience with an
XYZ Airline, maybe the service wasn’t good, or that
the airline was riddled with delays.
• You might be tempted to say that all flights on XYZ
airline are bad.
• This is called overfitting whereby we overgeneralize
something, which otherwise, might have been us
just having a bad day.
Underfitting-Overfitting

Source: Quora: Luis Argerich


Underfitting-Overfitting

Epochs

Source: https://meditationsonbianddatascience.com/2017/05/11/overfitting-underfitting-how-well-does-your-model-fit/
Bias-Variance
• Bias is the difference between your
model's expected predictions and the true values.
• That might sound strange because shouldn't you
"expect" your predictions to be close to the true
values?
• Well, it's not always that easy because some
algorithms are simply too rigid to learn complex
signals from the dataset.
Bias-Variance
• Imagine fitting a linear regression to a dataset that
has a non-linear pattern:

No matter how many more observations you


collect, a linear regression won't be able to
model the curves in that data! This is known
as under-fitting
Bias-Variance
• Variance refers to your algorithm's sensitivity to
specific sets of training data.
• High variance algorithms will produce drastically
different models depending on the training set.
Bias-Variance
• For example, imagine an algorithm that fits a
completely unconstrained, super-flexible model to
the same dataset from above:

As you can see, this unconstrained


model has basically memorized the
training set, including all the noise.
This is known as over-fitting.
Bias-Variance
Bias-Variance
Here’s what those models tell you about the chosen Algorithms
Bias-Variance
• Finally, an optimal balance of bias and variance
leads to a model that is neither overfit nor underfit:

This is the ultimate goal of


supervised machine learning –
To isolate the signal from the
dataset while ignoring the noise!
How to Combat Overfitting?
• Two ways to combat overfitting:
1. Use more training data. The more you have, the harder it is to
overfit the data by learning too much from any single training
example.

2. Use regularization. Add in a penalty in the loss function for


building a model
How to Combat Overfitting?

• The fist piece of the sum is our normal cost function.


• The second piece is a regularization term that adds a
penalty for large beta coefficients
• With these two elements in place, the cost function now
balances between two priorities: explaining the training
data and preventing that explanation from becoming
overly specific.
Regularization in Machine Learning

Illustrates the relationship between model capacity and the concepts of underfitting and overfitting by plotting the
training and test errors as a function of model capacity. When the capacity is low, then both the training and test errors
are high. As the capacity increases, the training error steadily decreases, but the test error initially decreases, but then
starts to increases due to overfitting. Hence the optimal model capacity is the one at which the test error is at a minimum
Regularization in Machine Learning
• How to make an algorithm/model perform well not
just on the training data, but also on new inputs?

• Many strategies are designed explicitly to reduce


the test error, possibly at the expense of increased
training error. These strategies are known
collectively as regularization.
How to Regularize?
• Put extra constraints on the model. Example: Add
restrictions on the parameter values.
• Add extra terms (as penalties) in the objective
function. Indirectly putting constraint on the
parameter values.
Regularization Techniques
• Parameter Norm-Penalties. (L1-norm and L2-norm)
• Dataset Augmentation
• Early Stopping
• Bagging and other ensemble methods
• Dropout
Parameter Norm-Penalties
Limit the capacity of models such as neural networks, linear
regression, or logistic regression, by adding a parameter
norm penalty Ω(θ) to the objective function J.

J(θ; X, y) = J(θ; X, y) + αΩ(θ)


where α ∈[0, ∞) is a hyperparameter that weights the relative contribution of the norm
penalty term, Ω, relative to the standard objective function J.

Setting α to zero results in no regularization. Larger values of α correspond to more


regularization
Dataset Augmentation
• The best way to make a machine learning model generalize better is to
train it on more data.
• How to generate more data?
- Label more data.
- Create fake data.
- Injecting noise in the data
- Inject noise to the model parameters
- Inject noise to the output
- For image data: rotation, translation and other transformation, inject
noise
Dataset Augmentation
Early Stopping
• When training large models with sufficient
representational capacity to overfit the task, it is
often observed that training error decreases
steadily over time, but validation set error begins to
rise again.

• This means we can obtain a model with better


validation set error (and thus, hopefully better test
set error) by returning to the parameter setting at
the point in time with the lowest validation set
error.
Early Stopping

Learning curves showing how the negative log-likelihood loss changes over time (indicated as number of training
iterations over the dataset, or epochs). In this example, a network is trained on MNIST. Observe that the training
objective decreases consistently over time, but the validation set average loss eventually begins to increase again,
forming an asymmetric U-shaped curve
Bagging and Other Ensemble
Methods
• Bagging is a technique for reducing generalization
error by combining several models.
• The idea is to train several different models
separately, then have all of the models vote on the
output for test examples.
• This is an example of a general strategy in machine
learning called model averaging. Techniques
employing this strategy are known as ensemble
methods
Bagging and Other Ensemble
Methods

A cartoon depiction of how bagging works. Suppose we train an ‘8’ detector on the dataset depicted above, containing an ‘8’,
a ‘6’ and a ‘9’. Suppose we make two different resampled datasets. The bagging training procedure is to construct each of
these datasets by sampling with replacement. The first dataset omits the ‘9’ and repeats the ‘8’. On this dataset, the detector
learns that a loop on top of the digit corresponds to an ‘8’. On the second dataset, we repeat the ‘9’ and omit the ‘6’. In this
case, the detector learns that a loop on the bottom of the digit corresponds to an ‘8’. Each of these individual classification
rules is brittle, but if we average their output then the detector is robust, achieving maximal confidence only when both loops
of the ‘8’ are present.
Dropout
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Training:

➢ Each time before computing the gradients


⚫ Each neuron has p% to dropout
Dropout
Pick a mini-batch
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
Training:

Thinner!

➢ Each time before computing the gradients


⚫ Each neuron has p% to dropout
The structure of the network is changed.
⚫ Using the new network for training
For each mini-batch, we resample the dropout
neurons
Dropout
Testing:

➢ No dropout
⚫ If the dropout rate at training is p%, all the weights times (1-p)%

⚫ Assume that the dropout rate is 50%. If a weight w = 1 by


training, set 𝑤 = 0.5 for testing.
Dropout - Intuition

partner

➢ When teams up, if everyone expect the partner will do the work, nothing will be
done finally.
➢ However, if you know your partner will dropout, you will do better.

➢ When testing, no one dropout actually, so obtaining good results eventually.


Dropout is a kind of ensemble.

Trainin
Ensemble g Set

Set 1 Set 2 Set 3 Set 4

Network Network Network Network


1 2 3 4

Train a bunch of networks with different structures


Dropout is a kind of ensemble.

Ensemble
Testing data x

Network Network Network Network


1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.

minibatc minibatc minibatc minibatc Training of


h h h h Dropout
1 2 3 4
M neurons

……
2M
possible
networks
➢Using one mini-batch to train one network
➢Some parameters in the network are shared
Dropout is a kind of ensemble.

Testing of Dropout testing data x

All the
weights

……
multiply
(1-p)%

y1 y2 y3

average ≈ y
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout
rate) when testing?
Training of Dropout Testing of Dropout
Assume dropout rate is 50% No dropout
Weights from training
𝑤1 0.5 × 𝑤1 𝑧 ′ ≈ 2𝑧
𝑤2 𝑧 0.5 × 𝑤2 𝑧 ′
𝑤3 0.5 × 𝑤3
𝑤4 0.5 × 𝑤4
Weights multiply (1-p)%
𝑧′ ≈ 𝑧
Usage of initializers

Initializations define the way to set the initial random weights of layers.

Available initializers
1. Zeros
2. Ones
3. Constant
4. Random Normal
5. Random Uniform
6. Truncated Normal
7. Variance scaling
8. Orthogonal
9. Identity
10.Lecun_uniform
11.Glorat_normal
12.Glorat_uniform
13.He_normal
14.Lecun_normal
15.Custum initializaion
Optimization
• Optimization is the most essential ingredient in the
recipe of machine learning algorithms. It starts with
defining some kind of loss function/cost function
and ends with minimizing it using one or the other
optimization routine.
• The choice of optimization algorithm can make a
difference between getting a good accuracy in
hours or days.
How Learning Differs from Pure
Optimization
• Typically, the cost function with respect to the training
set can be written as:

• where L is the per-example loss function, f(x; θ) is the


predicted output when the input is x, p’data is the
empirical distribution. In the supervised learning case,
y is the target output.
• The objective is to minimize the corresponding
objective function
Gradient descent
Gradient descent is a way to minimize an objective function J(θ)
θ ∈ Rd : model parameters
η: learning rate
∇θ J(θ): gradient of the objective function with regard to the
parameters
Update equation: θ = θ − η · ∇θ J(θ)
Sebastian Ruder

Figure: Optimization with gradient descent


Gradient descent variants
Batch gradient descent
Stochastic gradient descent
Mini-batch 1gradient descent
Difference: Amount of data used per update
Sebastian Ruder
Batch gradient descent
Computes gradient with the entire dataset.
Update equation: θ = θ − η · ∇θ J(θ)

for i in range( nb_epochs ): params_ grad = evaluate_ gradient (


loss_function , data , params)
Sebastian Ruder

params = params - learning_ rate * params_ grad


Listing : Code for batch gradient descent update
Batch gradient descent

Pros:
Guaranteed to converge to global minimum for convex error
surfaces and to a local minimum for non-convex surfaces.
Cons:
Very slow.
Intractable for datasets that do not fit in memory.
No online learning.
Stochastic gradient descent
Computes update for each example x (i )y (i ).
Update equation: θ = θ − η · ∇θ J(θ; x(i ); y (i ) )

f o r i i n r a n g e ( n b _ e p o c h s ) : n p . random .
s h u f f l e ( d a t a ) f o r exa m p l e i n d a t a :
Sebastian Ruder

p a ra m s _ g r a d = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n ,
exa m p le , p a r a m s )
p a ra m s = p a ra m s - l e a r n i n g _ r a t e *p a r a m s _ g r a d
Listing: Code for stochastic gradient descent update
Stochastic gradient descent
Pros
Much faster than batch gradient descent.
Allows online learning.
Cons
High variance updates.
Sebastian Ruder

Figure: SGD fluctuation (Source: Wikipedia)


Batch gradient descent vs. SGD
fluctuation

168 /
Sebastian Ruder 24.11.17 49

Figure: Batch gradient descent vs. SGD fluctuation (Source: wikidocs.net)

SGD shows same convergence behaviour as batch gradient descent if learning rate
is slowly decreased (annealed) over time.
Mini-batch gradient descent
Performs update for every mini-batch of n examples. Update
equation: θ = θ − η · ∇θJ(θ;x(i :i +n) ; y(i :i +n ) )

fo r i in ra n ge ( nb_epochs ) :
n p . ra n d o m . s h u f f l e ( d a t a )
f o r b a t c h i n g e t _ b a t c h e s ( d a t a , b a t c h _ s i z e =5 0 ) :
Sebastian Ruder

params_ g ra d = e v a l u a t e _ g r a d i e n t (
l o s s _ f u n c t i o n , batch , params)
params = params - l e a r n i n g _ r a t e * params_ g ra d
Listing : Code for mini-batch gradient descent update
Mini-batch gradient descent

Pros
Reduces variance of updates.
Can exploit matrix multiplication primitives.
Cons
Mini-batch size is a hyperparameter. Common sizes are50-256.
Sebastian Ruder

Typically the algorithm of choice.


Usually referred to as SGD even when mini-batches are used.
Comparison of trade-offs of gradient descent
variants
Update Memory Online
Method Accuracy
Speed Usage Learning
Batch
Good Slow High No
gradient descent
Stochastic Good (with
Sebastian Ruder
High Low Yes
gradient descent annealing)
Mini-batch
Good Medium Medium Yes
gradient descent

Table: Comparison of trade-offs of gradient descent variants


Regression with GD
Regression
Y = f(X) + ε, where X = (x1, x2…xn)
Training: machine learns f from labeled training data
Test: machine predicts Y from unlabeled testing data

Note: X can be a tensor with an any number of


dimensions. A 1D tensor is a vector (1 row,
many columns), a 2D tensor is a matrix (many rows,
many columns), and then you can have
tensors with 3, 4, 5 or more dimensions (e.g. a 3D
tensor with rows, columns, and depth).
Linear Regression (LR)
• Linear regression is a parametric method, which
means it makes an assumption about the form of
the function relating X and Y.
• Our model will be a function that predicts ŷ given a
specific x: 
y = 0 + 1  x + 

• In this case, we make the explicit assumption that there is a linear relationship between X and
Y—that is, for each one-unit increase in X, we see a constant increase (or decrease) in Y.
Linear Regression with GD
• Our goal is to learn the model parameters (in this
case, β0 and β1) that minimize error in the model’s
predictions.
• To find the best parameters:
• Define a cost function, error function or loss
function, that measures how inaccurate our
model’s predictions are.
• Find the parameters that minimize loss, i.e. make
our model as accurate as possible.
Linear Regression with GD
A note on dimensionality:
• Our example is 2-dimensional for simplicity, but you’ll
typically have more features (x’s) and coefficients (betas)
in your model.
• For example: When adding more relevant variables to
improve the accuracy of your model predictions. The
same principles generalize to higher dimensions, though
things get much harder to visualize beyond 3 dimensions.
Linear Regression with GD
• Mathematically, we look at the difference between each real
data point (y) and our model’s prediction (ŷ).
• This is a measure of how well our data fits the line.

Cost = i
 1i 0 i
((  x +  ) − y ) 2

n = no. of observations. 2  n
Linear Regression with GD

Source: Github Alykhan Tejani


Linear Regression with GD
• We see that the Cost
function is really a
function of two variables:
β0 and β1.
• All the rest of the
variables are determined,
since X, Y, and n are given
during training.
• We want to try to
minimize this function.
Linear Regression with GD
Idea: Choose β0 and β1, so that 𝒚 ෝ is close to y for
our training examples (x, y).

Remember: y = 0 + 1  x + 
Goal: minimize
𝛽 𝛽
0, 1
Cost (0, 1 )
•𝒚ෝ : (for fixed β0 and β1, this is a function of x )
• Cost : (function of the parameter fixed β0 and β1)
Linear Regression with GD
ෝ = β1.x
For simplicity, lets assume 𝒚
Then Our Goal becomes minimize
𝛽 1
Cost (1 )
Now, if β1=1, then lets compute what will be the
value of Cost function?

n ^
( y− y ) 2
Cost(  ) = i i
1
2n
Linear Regression with GD
x 1 2 3 x 1 2 3
y 1 2 3 0.5 1 1.5

Cos
3 t
4 2.5

COST
3 2
1.5
Y

2
1
1 0.5
0
0
0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X

if β1=0.5, then Cost(β1=0.5) = 0.58


Linear Regression with GD
x 1 2 3 x 1 2 3
y 1 2 3 1 2 3

Cos
3 t
4

COST
3 2
Y

2
1
1
0
0
0 1 2 3 4
0 1 2 3 4
Β1
X

if β1=1, then Cost(β1=1) = 0


Linear Regression with GD
x 1 2 3 x 1 2 3
y 1 2 3 0 0 0

Cos
3 t
4 2.5

COST
2
3
1.5
Y

2 1
1 0.5
0
0 0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X
if β1=0, then Cost(β1=0) = 2.3
Linear Regression with GD

5.25
Cos
3 t
4 2.5

COST
3 2
1.5
Y

2
1
1 0.5
0
0
0 0.5 1 1.5 2 2.5 3
0 1 2 3 4
Β1
X

if β1= - 0.5, then Cost(β1=-0.5) = 5.25


Linear Regression with GD
For Different Values of β1, it turns out to be like this

5.25
Cos
3 t
4 2.5

COST
3 2
1.5
Y

2
1
1 0.5
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5 3
Β1
X
Linear Regression with GD
For Different Values of β1, it turns out to be like this

5.25
Cos
t
4 3
2.5
3

COST
2
1.5
Y

2
1
1
0.5
0 0
0 1 2 3 4 0 0.5 1 1.5 2 2.5 3

X Β1

if β1= - 0.5, then Cost(β1=-0.5) = 5.25


Linear Regression with GD
• Remember Our Objective/Goal? : Did we achieve
that? - yes
minimize
𝛽1 Cost (1 )
Linear Regression with GD
Linear Regression with GD
• Repeat until convergence

 j =  j − Cost (0 , 1 )
 j
 = Learning Rate
• Simultaneous Update

0 = 0 −  Cost (0 , 1 )
0


1 = 1 −  Cost (0 , 1 )
1
Linear Regression with GD
 1i 0 i
Cost = i
((  x +  ) − y ) 2

2n y = 0 + 1  x + 

Basically we need to find out



Cost (0 , 1 )
 j
 i ( 1xi + 0 − yi ) 
2
( 1xi + 0 − yi )
n n

j = 0: Cost (0 , 1 ) = ( )= i
0 0 2 n n
 i 1 i
 +  −  ( 1xi + 0 − yi ) xi
n n

2
( x y )
j = 1: Cost (0 , 1 ) = )=
0 i i
(
1 1 2 n n
Gradient Descent for Linear
Regression
• Gradient Descent Algorithm:
Repeat until convergence:
 
0 = 0 −   Cost (0 , 1 )
 0

 1 = 1 −   Cost (0 , 1 )
 1
Update β0 and β1 simultaneously.
Optimization Algorithms:
Stochastic Gradient Descent

Learning rate selection. Source: https://www.jeremyjordan.me/nn-learning-rate/


Optimization Strategies… Batch Normalization

Batch Normalization

• Batch normalization (BN) is one of the most exciting innovations in Deep


learning that has significantly stabilized the learning process and allowed
faster convergence rates.

• The intuition: Most of the deep models are compositions of many layers (or
functions) and the gradient with respect to one layer is taken considering the
other layers to be constant.
Optimization Strategies… Batch Normalization
• Mathematical Intuition: BN is about normalizing the hidden units activation
values so that the distribution of these activations remains same during
training.

• During training of any deep neural network if the hidden activation


distribution changes because of the changes in the weights and bias values at
that layer, they cause rapid changes in the layer above it.

• This slows down the training a lot. The change in distribution of the hidden
activations during is called internal covariate shift which effect the training
speed of the network.
Optimization Strategies… Batch Normalization

We look at the one single Deep neural network as multiple subnetworks


Optimization Strategies… Batch Normalization
• How to Normalize the Hidden Units?

• Consider we have d number of hidden units in a hidden layer of any Deep


neural network.

• We can represent the activation values of this layer as x=[x1,x2,………xd].

• Now we can normalize the kth hidden unit activation using the formula bellow.
Optimization Strategies… Batch Normalization
• For this, we introduce 2 new variables, one for learning the mean and other
for variance.

• These parameters are learned and updated along with weights and biases
during training. The final normalized scaled and shifted version of the hidden
activation for the kth hidden unit is given bellow.
Optimization Strategies… Batch Normalization

Batch Normalization
Optimization Strategies… Batch Normalization

- Assume we have a minibatch of m training examples. We pass this minibatch to our


neural network. At layer i we get The hidden activations matrix Hi.
- We then compute the mean and variance for each column as shown in figure and apply
batch normalization transformation .
Optimization Strategies… Batch Normalization
BN during Inference/Testing

• During testing or inference phase we can’t apply the same BN as we did during
training because we might pass only one sample at a time so it doesn’t make
sense to find mean and variance on a single sample.

• We compute the running average of mean and variance of kth unit during
training and use those mean and variance values with trained batch-norm
parameters during testing phase.
Recipe for Learning

Don’t overfittin
forget! g
Modify the Network Preventing
Better optimization Overfitting
Strategy

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-
learning-explained-in-a-single-powerpoint-slide/

You might also like