0% found this document useful (0 votes)
13 views41 pages

Part 13 MD

computer vision slides 14

Uploaded by

yyyangwhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

Part 13 MD

computer vision slides 14

Uploaded by

yyyangwhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Visual Information Interpretation

Basics on deep learning: Part II

Ji Hui

1
Gradient descent revisited for NN
The objective function:

Gradient descent revisited

A batch algorithm running on all samples


prone to local minimizer
Motivation from stochastic optimization
Scale: estimate function and gradient from a small sub-set of data with enough iterations
Convergence: Introducing randomness in the iterations to avoid local convergence.

2
Stochastic gradient descent (SGD)

SGD: Randomized gradient estimate to minimize the function using a single randomly
picked example

it behaves like gradient descent in its expectation, but with noise due to variance:
Mini-batch SGD: the one balancing approx. and randomness
Rather using one sample, use a random small subset

where .

3
Batch size and Epoch
Batch Size
Larger batch size leads to more accurate gradient but with return not linearly scaled-up
Small batches offer regularizing effect due to random noise (mini-batch) added in process
Small batch sizes require small learning rate, due to high variance in estimate of gradient
One main limiting factor in batch size: Amount of GPU memor
It is crucial to select mini-batches randomly, and it is a better practice to make a pass
through training data sequentially:
Define training set: say 10,000 examples
Indexing: Assign numbers to them
Choose your batch size: say 100
Use a random number generator to generate a number in [1,10000]
Sequentially select your sample
an epoch is a single pass through the full training set.

4
Consider a 3-layer NN for binary classification

# Define a 3-layer NN with 2 hidden fully-connected layer


import torch.nn as nn
import torch.nn.functional as F
#Define a network model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# Fully connected Linear mapping: x-->Wx+b
# Input resticted to 16
self.fc1 = nn.Linear(16, 8)
self.fc2 = nn.Linear(8, 4)
self.fc3 = nn.Linear(4,1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x

5
Binary classification of two classes: Classifying two Gaussians
centroid_1 = torch.bernoulli(0.5 * torch.ones(16))
x_1 = 4 * torch.randn(50,16) + centroid_1
centroid_2 = torch.bernoulli(0.5 * torch.ones(16))
x_2 = 4 * torch.randn(50,16) + centroid_2
y_1 = torch.zeros(50,1)
y_2 = torch.ones(50,1)
x = torch.cat((x_1,x_2), 0)
y = torch.cat((y_1,y_2), 0)

idx = torch.randperm(100)
# 160 training samples
x_train = x[idx[0:80],:]
y_train = y[idx[0:80]]
# 40 testing samples
x_test = x[idx[80:100],:]
y_test = y[idx[80:100]]
print('The size of training dataset: ', x_train.size(0))
print('The size of testing dataset: ', x_test.size(0))

6
How to prepare the mini-batches for one epoch
# Generating the index set of mini-bacthes for one epoch
seq = torch.randperm(80)
batch_size = 4
batch = torch.reshape(seq,(20,4))
batch

7
Initial a model and define the loss
# Creating a network
net = Net()
# Define two training samples and their lables
input = torch.randn(2,16)
label = torch.Tensor([[1],[0]])
# Define a binary cross-entropy loss
criterion = nn.BCELoss()
output = net(input)
loss = criterion(output,label)
print('The prediction: ')
print(output)
print('The training lable:')
print(label)
print('The loss: ', loss.item())

8
Training a model with SGD using 2 samples
import matplotlib.pyplot as plt
ite_num =1000
loss_record = torch.zeros(1000)
for i in np.arange(0,ite_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
optimizer.zero_grad()
output = net(input)
loss = criterion(output, label)
loss_record[i] = loss.item()
loss.backward()
optimizer.step()
print('The plot of loss vs. iteration')
plt.figure(figsize=(4,2)); plt.plot(loss_record);

9
Training a model on 80 training samples with mini-batch size 4

epoch_num = 100
batch_num = 20
batch_size = 4
loss_record = torch.zeros(100)
for i in np.arange(0,epoch_num):
seq = torch.randperm(80)
batch = torch.reshape(seq,(batch_num,batch_size))
for j in np.arange(0,batch_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
optimizer.zero_grad()
input = x_train[batch[j],:]
label = y_train[batch[j]]
output = net(input)
loss = criterion(output, label)
loss.backward()
optimizer.step()
loss_record[i] = loss.item()

10
Prediction by the model
input = x_test; label = y_test
output = net(input)
print('The predictions on test data')
torch.set_printoptions(precision=2,sci_mode=False)
print(torch.cat((label,output.detach()),1))

11
"Prediction accuracy" and "Training error vs Epoch"
loss = criterion(output, label)
print("The test loss: ",loss)
torch.eq(y_test,torch.round(output.detach()))

accuracy = torch.sum(torch.eq(y_test,torch.round(output.detach()))) / 20 *100


print('The prediction arracy: ', accuracy,'%')
print('The plot of training loss vs epoch')
plt.figure(figsize=(5,3)); plt.plot(loss_record);

12
Load and save model in Pytorch

Save a trained model with full network structure and parameters


torch.save(net, 'trained-net.pkl')

Load a trained model for processing the data


model = torch.load('trained-net.pkl')
output = model(input)

13
Learning rate
Recall that learning rate refers to in GD:

What goes wrong with learning rate


If the learning rate is too big, the weights slosh to and fro across the ravine (Oscillation)
If the learning rate is too small, the training is slow and easily trapped in local minimima.
What we would like to achieve
Move quickly in directions with small but consistent gradients
Move slowly in directions with big but inconsistent gradients.

14
Basics for adjusting learning rate
Initial learning rate is important.
If the error keeps getting worse or oscillates wildly, reduce the learning rate.
After finishing one epoch, it usually is a good practice to lower down the learning rate.
Lower down the learning rate when the error stops decreasing, and check the error on a
separate validation set
Be careful about lowering down the learning rate too early
Common learning rate schedules
Constant: learning rate remains the same for all epochs
step decay: decrease (e.g., ) every epochs
Inverse decay:
Exponential decay: .

15
How to determine learning rate when GPUs are limited
1. Check initial loss of the network model

Sanity check the loss with initial weights


2. Overfitting a small training set
Try to train to 100% training accuracy on a small sample of training data ( 5-10
minibatches); fiddle with learning rate and weight initialization
3. Find a learning rate that makes loss going down

Use all data, find a learning rate that makes the loss drop fast within 100 epochs.
Good learning rates to try:
4. Find good learning rate decay strategy

Turn on learning rate decay; fiddle with different possibility


5. Look at loss and accuracy curves

6. Go to Step 4
16
Weight initialization

If two hidden units have exactly the same bias and exactly the same incoming and outgoing
weights, they will always get exactly the same gradient. So they cannot lean to make
difference
Weights must be initialized to preserve the variance of the activations
Initialization must be coordinated with the choice of non-linear activation function and data
normalization
When using steepest descent, shifting the input makes a big difference
It usually helps to transform each component of the input vector so that it has zero mean
over the whole training set
When using steepest descent, scaling the input values makes a big difference.
It usually helps to transform each component of the input vector so that it has unit
variance over the whole training set.

17
Weight initialization and active function

Weight initialization also depends on the choice of activation functions


For a , randomly sample from

where / denote number of input/output variables.


For a sigmoid, randomly sample from

For ReLU, randomly draw from , where is the number of neurons in the
input.

18
Ways to speed up mini-batch learning

Using momentum:
View a weight as a "particle", whose values means position.
Instead of using the gradient to change the position of the particle, use it to change the
velocity of the particle.
Use separate adaptive learning rates for each parameter
Slowly adjusting the rate using the consistency of the gradient for that parameter
Take a fancy method from the optimization literature that makes use of curvature information

19
SGD+Momentum
Intuition: Imagine a ball on the error surface. The ball starts off by following gradient, but
once it has velocity, it no longer does steepest descent. Indeed, its momentum makes it
keep going in the previous direction
Recall a SGD reads

The SGD+Momentum reads

where is the friction factor, often set as


The SGD+Momenturm build up "velocity" as a running mean of gradients.

20
Nesterov Momentum
Nesterov Momentum

Reformulated version: using

21
Adagrad and RMSdrop
AdaGrad: It used adaptive learning rate for each weight
Define , the gradient w.r.t. .
The update for each parameter is then

where : sum of the squared gradients up to time .


RMSProp
Instead of storing all previous squared gradient, one may recursively defining a decaying
average of all past gradients

The learning rate is then modified as follows

22
Adaptive Moment Estimation (Adam)
Besides storing an exponentially decaying average of past squared gradients like RMSprop,
Adam also keeps an exponentially decaying average of past gradients, like momentum.
Define first momentum and second momentum :

Define its bias-corrected version:

Then, the Adam call the RMSProp for learning rate

Default parameters: .

23
Base learning rate is important for training
All methods have learning rate as a hyperparameter.

24
Data augmentation: Deep learning models need data to be trained.
Data augmentation: we construct new ones based on what we have.
Data augmentation also reduce overfitting and improve the model generalization as it
increases the diversity of training data.
Augmenting the dataset using basic image manipulations

Image flipping, cropping, rotations, and translations


Photometric transforms: Change the color space of the
image using contrast, sharpening, white balancing, color
jittering.
Mix images together, randomly erase segments of an
image

25
Illustration

26
Batch normalization

Motivation: Internal covariate shift from one layer to another layer, which makes the training
harder
smaller learning rate
careful initialization
Why internal co-variate shift causes troubles in training?
Recall that the goal of each layer is to model the input from the layer below it
When the statistical distribution of the input of a hidden layer keeps changing during the
iterations, the hidden layers will keep trying to adapt to that new distribution hence slowing
down convergence.
It is like the goal of the hidden layers keep changing, not fixed.
Batch normalization: Keep the distribution of the input same over the different layers.

27
Batch normalization (BN)

Batch normalization: Normalize the input for the batch


The procedure of a BN :

The BN layer usually is put right before activation function

Data distribution before BN Data distribution after BN

28
BN layer in Network
Not all distributions should be normalized
Let the model decide how the distribution should look
Introducing two learnable parameters

BN in test time (different from training time)


The mean and variance and is not updated on test data.
During the testing, the mean and variance and on training set is used for inference.
benefit of using BN layer: Makes training easier !
Improves gradient flow
Allows higher learning rates, faster convergence
Networks become more robust to initialization

29
Generalization of learning by Neural network
There are two basic modes of operations in learning
Learning mode: learns how to make predictions by training on labelled samples
Test mode: Test mode: makes predictions on new data, i.e., data not in training set or
cross validation set.
Model capacity: the amount of freedoms of a NN has to model data
The higher is the NN's capacity, the higher is the percentage of the samples the NN will fit
nearly perfectly
the capacity can be changed by changing the numbers of layers, nodes, the degree of
non-linearity, iteration numbers.
Model generalization: how well the NN performances on new inputs not seen before.
Trade-off between Capacity and Generalization
A NN may have high capacity, but poor generalization
Lowering capacity might improve a NN's generalization

30
Performance of NN
Under-fitting (capacity is poor)
Under-fitting happens when the learner has not found a solution that fitting training data to
an acceptable level
A learner that underfits the training data will miss important aspect of the data.
Over-fitting (generalization is poor)
over-fitting happens when using a learner with too high capacity that represents training
data nearly perfectly, but perform poorly in unseen new input data.
Example: high-order polynomial regression

31
Optimal capacity and overcoming over-fitting

Optimal capacity: associated with the transition from under-fitting to over-fitting.


Optimal capacity should increase with size of training data

32
Regularization for avoiding over-fitting
What is regularization?
In general: any method to prevent overfitting or help the optimization
Specifically: additional terms in cost function or training technique for preventing overfitting
or help the optimization
Regularization as hard constaints
Training objective

subject to .
when can be parameterized

subject to .
33
Adding
| noise to input during training

Adding noise to input: Preferring


Equivalence to weight decay
Consider linear model ,
After adding noise, the loss is

Equivalent to Wiener estimation 34


Adding noise to the weight during training
For the loss on each data point, add a noise term to the weights before computing the
prediction

Using instead of for prediction


The corresponding loss is now

By Taylor expansion, we have

The loss can be approximated by

Equivalent to Tikhonov regularization

35
Regularization as optimization

One might formulate via some inequality

subject to
Or equivalently, convert the problem to a unconstrained problem

where is some pre-defined regularization parameter


For example,

36
Early stopping: Don't train the network to too small training error

Training stopped at point of smallest error with validation data


When training, also output validation error
Every time validation error improved, store a copy of the weights
When validation error not improved for some time, stop
Return the copy of the weights stored

training set error Validation set error

37
Early stopping

Advantage
Efficient: along with training; only store an extra copy of weights
Simple: no change to the model/ algorithm
Disadvantage: need validation data

38
Dropout
Basic idea
Randomly removing the unit/layer during the training
Putting all back during testing

Left: A network with 2 hidden layers. Right: An example of a thinned net produced by
applying dropout to the network on the left. Crossed units have been dropped.\

39
Mathematical model of dropout
For a dropout-enabled NN, units are multiplied by independent random Bernoulli variables.

For a unit activation, we have

For each mini-batch, the Bernoulli variables are independently sampled to generate an NN.
Dropout is implemented by adding a module that drops activations at random on each batch.

40
Torch: Dropout layer in NN

41

You might also like