0% found this document useful (0 votes)

13 views41 pages

Part 13 MD

computer vision slides 14

Uploaded by

yyyangwhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views41 pages

Part 13 MD

computer vision slides 14

Uploaded by

yyyangwhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Visual Information Interpretation

Basics on deep learning: Part II

Ji Hui

1
Gradient descent revisited for NN
The objective function:

Gradient descent revisited

A batch algorithm running on all samples

prone to local minimizer
Motivation from stochastic optimization
Scale: estimate function and gradient from a small sub-set of data with enough iterations
Convergence: Introducing randomness in the iterations to avoid local convergence.

2
Stochastic gradient descent (SGD)

SGD: Randomized gradient estimate to minimize the function using a single randomly
picked example

it behaves like gradient descent in its expectation, but with noise due to variance:
Mini-batch SGD: the one balancing approx. and randomness
Rather using one sample, use a random small subset

where .

3
Batch size and Epoch
Batch Size
Larger batch size leads to more accurate gradient but with return not linearly scaled-up
Small batches offer regularizing effect due to random noise (mini-batch) added in process
Small batch sizes require small learning rate, due to high variance in estimate of gradient
One main limiting factor in batch size: Amount of GPU memor
It is crucial to select mini-batches randomly, and it is a better practice to make a pass
through training data sequentially:
Define training set: say 10,000 examples
Indexing: Assign numbers to them
Choose your batch size: say 100
Use a random number generator to generate a number in [1,10000]
Sequentially select your sample
an epoch is a single pass through the full training set.

4
Consider a 3-layer NN for binary classification

# Define a 3-layer NN with 2 hidden fully-connected layer

import torch.nn as nn
import torch.nn.functional as F
#Define a network model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# Fully connected Linear mapping: x-->Wx+b
# Input resticted to 16
self.fc1 = nn.Linear(16, 8)
self.fc2 = nn.Linear(8, 4)
self.fc3 = nn.Linear(4,1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x

5
Binary classification of two classes: Classifying two Gaussians
centroid_1 = torch.bernoulli(0.5 * torch.ones(16))
x_1 = 4 * torch.randn(50,16) + centroid_1
centroid_2 = torch.bernoulli(0.5 * torch.ones(16))
x_2 = 4 * torch.randn(50,16) + centroid_2
y_1 = torch.zeros(50,1)
y_2 = torch.ones(50,1)
x = torch.cat((x_1,x_2), 0)
y = torch.cat((y_1,y_2), 0)

idx = torch.randperm(100)
# 160 training samples
x_train = x[idx[0:80],:]
y_train = y[idx[0:80]]
# 40 testing samples
x_test = x[idx[80:100],:]
y_test = y[idx[80:100]]
print('The size of training dataset: ', x_train.size(0))
print('The size of testing dataset: ', x_test.size(0))

6
How to prepare the mini-batches for one epoch
# Generating the index set of mini-bacthes for one epoch
seq = torch.randperm(80)
batch_size = 4
batch = torch.reshape(seq,(20,4))
batch

7
Initial a model and define the loss
# Creating a network
net = Net()
# Define two training samples and their lables
input = torch.randn(2,16)
label = torch.Tensor([[1],[0]])
# Define a binary cross-entropy loss
criterion = nn.BCELoss()
output = net(input)
loss = criterion(output,label)
print('The prediction: ')
print(output)
print('The training lable:')
print(label)
print('The loss: ', loss.item())

8
Training a model with SGD using 2 samples
import matplotlib.pyplot as plt
ite_num =1000
loss_record = torch.zeros(1000)
for i in np.arange(0,ite_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
optimizer.zero_grad()
output = net(input)
loss = criterion(output, label)
loss_record[i] = loss.item()
loss.backward()
optimizer.step()
print('The plot of loss vs. iteration')
plt.figure(figsize=(4,2)); plt.plot(loss_record);

9
Training a model on 80 training samples with mini-batch size 4

epoch_num = 100
batch_num = 20
batch_size = 4
loss_record = torch.zeros(100)
for i in np.arange(0,epoch_num):
seq = torch.randperm(80)
batch = torch.reshape(seq,(batch_num,batch_size))
for j in np.arange(0,batch_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
optimizer.zero_grad()
input = x_train[batch[j],:]
label = y_train[batch[j]]
output = net(input)
loss = criterion(output, label)
loss.backward()
optimizer.step()
loss_record[i] = loss.item()

10
Prediction by the model
input = x_test; label = y_test
output = net(input)
print('The predictions on test data')
torch.set_printoptions(precision=2,sci_mode=False)
print(torch.cat((label,output.detach()),1))

11
"Prediction accuracy" and "Training error vs Epoch"
loss = criterion(output, label)
print("The test loss: ",loss)
torch.eq(y_test,torch.round(output.detach()))

accuracy = torch.sum(torch.eq(y_test,torch.round(output.detach()))) / 20 *100

print('The prediction arracy: ', accuracy,'%')
print('The plot of training loss vs epoch')
plt.figure(figsize=(5,3)); plt.plot(loss_record);

12
Load and save model in Pytorch

Save a trained model with full network structure and parameters

torch.save(net, 'trained-net.pkl')

Load a trained model for processing the data

model = torch.load('trained-net.pkl')
output = model(input)

13
Learning rate
Recall that learning rate refers to in GD:

What goes wrong with learning rate

If the learning rate is too big, the weights slosh to and fro across the ravine (Oscillation)
If the learning rate is too small, the training is slow and easily trapped in local minimima.
What we would like to achieve
Move quickly in directions with small but consistent gradients
Move slowly in directions with big but inconsistent gradients.

14
Basics for adjusting learning rate
Initial learning rate is important.
If the error keeps getting worse or oscillates wildly, reduce the learning rate.
After finishing one epoch, it usually is a good practice to lower down the learning rate.
Lower down the learning rate when the error stops decreasing, and check the error on a
separate validation set
Be careful about lowering down the learning rate too early
Common learning rate schedules
Constant: learning rate remains the same for all epochs
step decay: decrease (e.g., ) every epochs
Inverse decay:
Exponential decay: .

15
How to determine learning rate when GPUs are limited
1. Check initial loss of the network model

Sanity check the loss with initial weights

2. Overfitting a small training set
Try to train to 100% training accuracy on a small sample of training data ( 5-10
minibatches); fiddle with learning rate and weight initialization
3. Find a learning rate that makes loss going down

Use all data, find a learning rate that makes the loss drop fast within 100 epochs.
Good learning rates to try:
4. Find good learning rate decay strategy

Turn on learning rate decay; fiddle with different possibility

5. Look at loss and accuracy curves

6. Go to Step 4
16
Weight initialization

If two hidden units have exactly the same bias and exactly the same incoming and outgoing
weights, they will always get exactly the same gradient. So they cannot lean to make
difference
Weights must be initialized to preserve the variance of the activations
Initialization must be coordinated with the choice of non-linear activation function and data
normalization
When using steepest descent, shifting the input makes a big difference
It usually helps to transform each component of the input vector so that it has zero mean
over the whole training set
When using steepest descent, scaling the input values makes a big difference.
It usually helps to transform each component of the input vector so that it has unit
variance over the whole training set.

17
Weight initialization and active function

Weight initialization also depends on the choice of activation functions

For a , randomly sample from

where / denote number of input/output variables.

For a sigmoid, randomly sample from

For ReLU, randomly draw from , where is the number of neurons in the
input.

18
Ways to speed up mini-batch learning

Using momentum:
View a weight as a "particle", whose values means position.
Instead of using the gradient to change the position of the particle, use it to change the
velocity of the particle.
Use separate adaptive learning rates for each parameter
Slowly adjusting the rate using the consistency of the gradient for that parameter
Take a fancy method from the optimization literature that makes use of curvature information

19
SGD+Momentum
Intuition: Imagine a ball on the error surface. The ball starts off by following gradient, but
once it has velocity, it no longer does steepest descent. Indeed, its momentum makes it
keep going in the previous direction
Recall a SGD reads

The SGD+Momentum reads

where is the friction factor, often set as

The SGD+Momenturm build up "velocity" as a running mean of gradients.

20
Nesterov Momentum
Nesterov Momentum

Reformulated version: using

21
Adagrad and RMSdrop
AdaGrad: It used adaptive learning rate for each weight
Define , the gradient w.r.t. .
The update for each parameter is then

where : sum of the squared gradients up to time .

RMSProp
Instead of storing all previous squared gradient, one may recursively defining a decaying
average of all past gradients

The learning rate is then modified as follows

22
Adaptive Moment Estimation (Adam)
Besides storing an exponentially decaying average of past squared gradients like RMSprop,
Adam also keeps an exponentially decaying average of past gradients, like momentum.
Define first momentum and second momentum :

Define its bias-corrected version:

Then, the Adam call the RMSProp for learning rate

Default parameters: .

23
Base learning rate is important for training
All methods have learning rate as a hyperparameter.

24
Data augmentation: Deep learning models need data to be trained.
Data augmentation: we construct new ones based on what we have.
Data augmentation also reduce overfitting and improve the model generalization as it
increases the diversity of training data.
Augmenting the dataset using basic image manipulations

Image flipping, cropping, rotations, and translations

Photometric transforms: Change the color space of the
image using contrast, sharpening, white balancing, color
jittering.
Mix images together, randomly erase segments of an
image

25
Illustration

26
Batch normalization

Motivation: Internal covariate shift from one layer to another layer, which makes the training
harder
smaller learning rate
careful initialization
Why internal co-variate shift causes troubles in training?
Recall that the goal of each layer is to model the input from the layer below it
When the statistical distribution of the input of a hidden layer keeps changing during the
iterations, the hidden layers will keep trying to adapt to that new distribution hence slowing
down convergence.
It is like the goal of the hidden layers keep changing, not fixed.
Batch normalization: Keep the distribution of the input same over the different layers.

27
Batch normalization (BN)

Batch normalization: Normalize the input for the batch

The procedure of a BN :

The BN layer usually is put right before activation function

Data distribution before BN Data distribution after BN

28
BN layer in Network
Not all distributions should be normalized
Let the model decide how the distribution should look
Introducing two learnable parameters

BN in test time (different from training time)

The mean and variance and is not updated on test data.
During the testing, the mean and variance and on training set is used for inference.
benefit of using BN layer: Makes training easier !
Improves gradient flow
Allows higher learning rates, faster convergence
Networks become more robust to initialization

29
Generalization of learning by Neural network
There are two basic modes of operations in learning
Learning mode: learns how to make predictions by training on labelled samples
Test mode: Test mode: makes predictions on new data, i.e., data not in training set or
cross validation set.
Model capacity: the amount of freedoms of a NN has to model data
The higher is the NN's capacity, the higher is the percentage of the samples the NN will fit
nearly perfectly
the capacity can be changed by changing the numbers of layers, nodes, the degree of
non-linearity, iteration numbers.
Model generalization: how well the NN performances on new inputs not seen before.
Trade-off between Capacity and Generalization
A NN may have high capacity, but poor generalization
Lowering capacity might improve a NN's generalization

30
Performance of NN
Under-fitting (capacity is poor)
Under-fitting happens when the learner has not found a solution that fitting training data to
an acceptable level
A learner that underfits the training data will miss important aspect of the data.
Over-fitting (generalization is poor)
over-fitting happens when using a learner with too high capacity that represents training
data nearly perfectly, but perform poorly in unseen new input data.
Example: high-order polynomial regression

31
Optimal capacity and overcoming over-fitting

Optimal capacity: associated with the transition from under-fitting to over-fitting.

Optimal capacity should increase with size of training data

32
Regularization for avoiding over-fitting
What is regularization?
In general: any method to prevent overfitting or help the optimization
Specifically: additional terms in cost function or training technique for preventing overfitting
or help the optimization
Regularization as hard constaints
Training objective

subject to .
when can be parameterized

subject to .
33
Adding
| noise to input during training

Adding noise to input: Preferring

Equivalence to weight decay
Consider linear model ,
After adding noise, the loss is

Equivalent to Wiener estimation 34

Adding noise to the weight during training
For the loss on each data point, add a noise term to the weights before computing the
prediction

Using instead of for prediction

The corresponding loss is now

By Taylor expansion, we have

The loss can be approximated by

Equivalent to Tikhonov regularization

35
Regularization as optimization

One might formulate via some inequality

subject to
Or equivalently, convert the problem to a unconstrained problem

where is some pre-defined regularization parameter

For example,

36
Early stopping: Don't train the network to too small training error

Training stopped at point of smallest error with validation data

When training, also output validation error
Every time validation error improved, store a copy of the weights
When validation error not improved for some time, stop
Return the copy of the weights stored

training set error Validation set error

37
Early stopping

Advantage
Efficient: along with training; only store an extra copy of weights
Simple: no change to the model/ algorithm
Disadvantage: need validation data

38
Dropout
Basic idea
Randomly removing the unit/layer during the training
Putting all back during testing

Left: A network with 2 hidden layers. Right: An example of a thinned net produced by
applying dropout to the network on the left. Crossed units have been dropped.\

39
Mathematical model of dropout
For a dropout-enabled NN, units are multiplied by independent random Bernoulli variables.

For a unit activation, we have

For each mini-batch, the Bernoulli variables are independently sampled to generate an NN.
Dropout is implemented by adding a module that drops activations at random on each batch.

40
Torch: Dropout layer in NN

Heartstart Intrepid: Monitor/Defibrillator
No ratings yet
Heartstart Intrepid: Monitor/Defibrillator
296 pages
Java Interview Notes
No ratings yet
Java Interview Notes
56 pages
IAM Modernization Program Charter
No ratings yet
IAM Modernization Program Charter
5 pages
Training NNs
No ratings yet
Training NNs
34 pages
Lec 8
No ratings yet
Lec 8
43 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
cours5
No ratings yet
cours5
23 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
cs519 hw2
No ratings yet
cs519 hw2
15 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Lect 7
No ratings yet
Lect 7
43 pages
08 Training
No ratings yet
08 Training
18 pages
2 Deep Neural Network_241120_095158
No ratings yet
2 Deep Neural Network_241120_095158
47 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Neural network intro lecture 4
No ratings yet
Neural network intro lecture 4
46 pages
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
No ratings yet
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
8 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
Module 3.Docxaiml
No ratings yet
Module 3.Docxaiml
20 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Adaline Sgd
No ratings yet
Adaline Sgd
4 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
Slide 2-f2
No ratings yet
Slide 2-f2
52 pages
ANN-TP (1)
No ratings yet
ANN-TP (1)
40 pages
2. Deep Neural Network
No ratings yet
2. Deep Neural Network
60 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
CW UFMFW7-15-3 - Assignment 2021
No ratings yet
CW UFMFW7-15-3 - Assignment 2021
4 pages
Module 5 Hashing
No ratings yet
Module 5 Hashing
66 pages
Huawei Win-Win - Publication
No ratings yet
Huawei Win-Win - Publication
60 pages
Google Pay Ug
No ratings yet
Google Pay Ug
59 pages
Mehyar Swelim Resume
No ratings yet
Mehyar Swelim Resume
1 page
Jpro-Technologies JUGSaxony
No ratings yet
Jpro-Technologies JUGSaxony
50 pages
SPE Expert 1K-FA User Rev3.2
No ratings yet
SPE Expert 1K-FA User Rev3.2
75 pages
Syllabus Book
No ratings yet
Syllabus Book
10 pages
CISSP - Course Slides
No ratings yet
CISSP - Course Slides
5 pages
Adblock Yaml
No ratings yet
Adblock Yaml
70 pages
CC Unit 3 PPT 1.pptx
No ratings yet
CC Unit 3 PPT 1.pptx
21 pages
Building The Data Warehouse - Chapter 04
No ratings yet
Building The Data Warehouse - Chapter 04
20 pages
CliniqueGuidelines Color 2022 V2 (1)
No ratings yet
CliniqueGuidelines Color 2022 V2 (1)
28 pages
Module 1-2m
No ratings yet
Module 1-2m
39 pages
SWT Lab-3 Unit-Testing g2
No ratings yet
SWT Lab-3 Unit-Testing g2
102 pages
2.1 Apply quality standards
No ratings yet
2.1 Apply quality standards
3 pages
Operating System Module Only For Exit Exam Preparation Dawit
100% (1)
Operating System Module Only For Exit Exam Preparation Dawit
29 pages
第七单元
No ratings yet
第七单元
90 pages
2025 Tech Forecast - Trends,, Tools, Skills
No ratings yet
2025 Tech Forecast - Trends,, Tools, Skills
33 pages
?The Ultimate Guide to Learning Microsoft Excel Course
No ratings yet
?The Ultimate Guide to Learning Microsoft Excel Course
6 pages
Applications of C++ - Uses of C++ - C++ Tutorial
100% (1)
Applications of C++ - Uses of C++ - C++ Tutorial
2 pages
Literature Survey 2.1 Literature Survey
No ratings yet
Literature Survey 2.1 Literature Survey
14 pages
E101-3237 Hydrojet Retractable Furnace Cleaning
No ratings yet
E101-3237 Hydrojet Retractable Furnace Cleaning
4 pages
btech-cse-3-sem-data-structure-and-algorithms-pcc-cs301-2024 (1)
No ratings yet
btech-cse-3-sem-data-structure-and-algorithms-pcc-cs301-2024 (1)
1 page
(L2 Homework - Spreadsheets - KS4)
No ratings yet
(L2 Homework - Spreadsheets - KS4)
3 pages
6 71 W65D0 D02
No ratings yet
6 71 W65D0 D02
108 pages
My Account - Your Orders Currys
No ratings yet
My Account - Your Orders Currys
1 page