Part 13 MD
Part 13 MD
Ji Hui
1
Gradient descent revisited for NN
The objective function:
2
Stochastic gradient descent (SGD)
SGD: Randomized gradient estimate to minimize the function using a single randomly
picked example
it behaves like gradient descent in its expectation, but with noise due to variance:
Mini-batch SGD: the one balancing approx. and randomness
Rather using one sample, use a random small subset
where .
3
Batch size and Epoch
Batch Size
Larger batch size leads to more accurate gradient but with return not linearly scaled-up
Small batches offer regularizing effect due to random noise (mini-batch) added in process
Small batch sizes require small learning rate, due to high variance in estimate of gradient
One main limiting factor in batch size: Amount of GPU memor
It is crucial to select mini-batches randomly, and it is a better practice to make a pass
through training data sequentially:
Define training set: say 10,000 examples
Indexing: Assign numbers to them
Choose your batch size: say 100
Use a random number generator to generate a number in [1,10000]
Sequentially select your sample
an epoch is a single pass through the full training set.
4
Consider a 3-layer NN for binary classification
5
Binary classification of two classes: Classifying two Gaussians
centroid_1 = torch.bernoulli(0.5 * torch.ones(16))
x_1 = 4 * torch.randn(50,16) + centroid_1
centroid_2 = torch.bernoulli(0.5 * torch.ones(16))
x_2 = 4 * torch.randn(50,16) + centroid_2
y_1 = torch.zeros(50,1)
y_2 = torch.ones(50,1)
x = torch.cat((x_1,x_2), 0)
y = torch.cat((y_1,y_2), 0)
idx = torch.randperm(100)
# 160 training samples
x_train = x[idx[0:80],:]
y_train = y[idx[0:80]]
# 40 testing samples
x_test = x[idx[80:100],:]
y_test = y[idx[80:100]]
print('The size of training dataset: ', x_train.size(0))
print('The size of testing dataset: ', x_test.size(0))
6
How to prepare the mini-batches for one epoch
# Generating the index set of mini-bacthes for one epoch
seq = torch.randperm(80)
batch_size = 4
batch = torch.reshape(seq,(20,4))
batch
7
Initial a model and define the loss
# Creating a network
net = Net()
# Define two training samples and their lables
input = torch.randn(2,16)
label = torch.Tensor([[1],[0]])
# Define a binary cross-entropy loss
criterion = nn.BCELoss()
output = net(input)
loss = criterion(output,label)
print('The prediction: ')
print(output)
print('The training lable:')
print(label)
print('The loss: ', loss.item())
8
Training a model with SGD using 2 samples
import matplotlib.pyplot as plt
ite_num =1000
loss_record = torch.zeros(1000)
for i in np.arange(0,ite_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)
optimizer.zero_grad()
output = net(input)
loss = criterion(output, label)
loss_record[i] = loss.item()
loss.backward()
optimizer.step()
print('The plot of loss vs. iteration')
plt.figure(figsize=(4,2)); plt.plot(loss_record);
9
Training a model on 80 training samples with mini-batch size 4
epoch_num = 100
batch_num = 20
batch_size = 4
loss_record = torch.zeros(100)
for i in np.arange(0,epoch_num):
seq = torch.randperm(80)
batch = torch.reshape(seq,(batch_num,batch_size))
for j in np.arange(0,batch_num):
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
optimizer.zero_grad()
input = x_train[batch[j],:]
label = y_train[batch[j]]
output = net(input)
loss = criterion(output, label)
loss.backward()
optimizer.step()
loss_record[i] = loss.item()
10
Prediction by the model
input = x_test; label = y_test
output = net(input)
print('The predictions on test data')
torch.set_printoptions(precision=2,sci_mode=False)
print(torch.cat((label,output.detach()),1))
11
"Prediction accuracy" and "Training error vs Epoch"
loss = criterion(output, label)
print("The test loss: ",loss)
torch.eq(y_test,torch.round(output.detach()))
12
Load and save model in Pytorch
13
Learning rate
Recall that learning rate refers to in GD:
14
Basics for adjusting learning rate
Initial learning rate is important.
If the error keeps getting worse or oscillates wildly, reduce the learning rate.
After finishing one epoch, it usually is a good practice to lower down the learning rate.
Lower down the learning rate when the error stops decreasing, and check the error on a
separate validation set
Be careful about lowering down the learning rate too early
Common learning rate schedules
Constant: learning rate remains the same for all epochs
step decay: decrease (e.g., ) every epochs
Inverse decay:
Exponential decay: .
15
How to determine learning rate when GPUs are limited
1. Check initial loss of the network model
Use all data, find a learning rate that makes the loss drop fast within 100 epochs.
Good learning rates to try:
4. Find good learning rate decay strategy
6. Go to Step 4
16
Weight initialization
If two hidden units have exactly the same bias and exactly the same incoming and outgoing
weights, they will always get exactly the same gradient. So they cannot lean to make
difference
Weights must be initialized to preserve the variance of the activations
Initialization must be coordinated with the choice of non-linear activation function and data
normalization
When using steepest descent, shifting the input makes a big difference
It usually helps to transform each component of the input vector so that it has zero mean
over the whole training set
When using steepest descent, scaling the input values makes a big difference.
It usually helps to transform each component of the input vector so that it has unit
variance over the whole training set.
17
Weight initialization and active function
For ReLU, randomly draw from , where is the number of neurons in the
input.
18
Ways to speed up mini-batch learning
Using momentum:
View a weight as a "particle", whose values means position.
Instead of using the gradient to change the position of the particle, use it to change the
velocity of the particle.
Use separate adaptive learning rates for each parameter
Slowly adjusting the rate using the consistency of the gradient for that parameter
Take a fancy method from the optimization literature that makes use of curvature information
19
SGD+Momentum
Intuition: Imagine a ball on the error surface. The ball starts off by following gradient, but
once it has velocity, it no longer does steepest descent. Indeed, its momentum makes it
keep going in the previous direction
Recall a SGD reads
20
Nesterov Momentum
Nesterov Momentum
21
Adagrad and RMSdrop
AdaGrad: It used adaptive learning rate for each weight
Define , the gradient w.r.t. .
The update for each parameter is then
22
Adaptive Moment Estimation (Adam)
Besides storing an exponentially decaying average of past squared gradients like RMSprop,
Adam also keeps an exponentially decaying average of past gradients, like momentum.
Define first momentum and second momentum :
Default parameters: .
23
Base learning rate is important for training
All methods have learning rate as a hyperparameter.
24
Data augmentation: Deep learning models need data to be trained.
Data augmentation: we construct new ones based on what we have.
Data augmentation also reduce overfitting and improve the model generalization as it
increases the diversity of training data.
Augmenting the dataset using basic image manipulations
25
Illustration
26
Batch normalization
Motivation: Internal covariate shift from one layer to another layer, which makes the training
harder
smaller learning rate
careful initialization
Why internal co-variate shift causes troubles in training?
Recall that the goal of each layer is to model the input from the layer below it
When the statistical distribution of the input of a hidden layer keeps changing during the
iterations, the hidden layers will keep trying to adapt to that new distribution hence slowing
down convergence.
It is like the goal of the hidden layers keep changing, not fixed.
Batch normalization: Keep the distribution of the input same over the different layers.
27
Batch normalization (BN)
28
BN layer in Network
Not all distributions should be normalized
Let the model decide how the distribution should look
Introducing two learnable parameters
29
Generalization of learning by Neural network
There are two basic modes of operations in learning
Learning mode: learns how to make predictions by training on labelled samples
Test mode: Test mode: makes predictions on new data, i.e., data not in training set or
cross validation set.
Model capacity: the amount of freedoms of a NN has to model data
The higher is the NN's capacity, the higher is the percentage of the samples the NN will fit
nearly perfectly
the capacity can be changed by changing the numbers of layers, nodes, the degree of
non-linearity, iteration numbers.
Model generalization: how well the NN performances on new inputs not seen before.
Trade-off between Capacity and Generalization
A NN may have high capacity, but poor generalization
Lowering capacity might improve a NN's generalization
30
Performance of NN
Under-fitting (capacity is poor)
Under-fitting happens when the learner has not found a solution that fitting training data to
an acceptable level
A learner that underfits the training data will miss important aspect of the data.
Over-fitting (generalization is poor)
over-fitting happens when using a learner with too high capacity that represents training
data nearly perfectly, but perform poorly in unseen new input data.
Example: high-order polynomial regression
31
Optimal capacity and overcoming over-fitting
32
Regularization for avoiding over-fitting
What is regularization?
In general: any method to prevent overfitting or help the optimization
Specifically: additional terms in cost function or training technique for preventing overfitting
or help the optimization
Regularization as hard constaints
Training objective
subject to .
when can be parameterized
subject to .
33
Adding
| noise to input during training
35
Regularization as optimization
subject to
Or equivalently, convert the problem to a unconstrained problem
36
Early stopping: Don't train the network to too small training error
37
Early stopping
Advantage
Efficient: along with training; only store an extra copy of weights
Simple: no change to the model/ algorithm
Disadvantage: need validation data
38
Dropout
Basic idea
Randomly removing the unit/layer during the training
Putting all back during testing
Left: A network with 2 hidden layers. Right: An example of a thinned net produced by
applying dropout to the network on the left. Crossed units have been dropped.\
39
Mathematical model of dropout
For a dropout-enabled NN, units are multiplied by independent random Bernoulli variables.
For each mini-batch, the Bernoulli variables are independently sampled to generate an NN.
Dropout is implemented by adding a module that drops activations at random on each batch.
40
Torch: Dropout layer in NN
41