CS 230 - Deep Learning Tips and Tricks Cheatsheet
CS 230 - Deep Learning Tips and Tricks Cheatsheet
Subscribe here
(https://docs.google.com/forms/d/e/1FAIpQLSeOr-
yp8VzYIs4ZtE9HVkRcMJyDcJ2FieM82fUsFoCssHu9DA/viewform) to be notified of
new releases!
(https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-
tricks#cs-230---deep-learning)CS 230 - Deep Learning (teaching/cs-230)
English
(https://stanford.edu/~shervine/teaching/cs-
230/cheatsheet-deep-learning-tips-and-
tricks#cheatsheet)Deep Learning Tips and
Tricks cheatsheet Star 6,531
follows:
xi − μ B
xi ⟵ γ +β
σB2 + ϵ
It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
(https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-
deep-learning-tips-and-tricks#running-nn)
Training a neural network
Definitions
❐Epoch ― In the context of training a model, epoch is a term used to refer to one iteration
where the model sees the whole training set to update its weights.
❐ Mini-batch gradient descent ― During the training phase, updating weights is usually not
based on the whole training set at once due to computation complexities or one data point due
to noise issues. Instead, the update step is done on mini-batches, where the number of data
points in a batch is a hyperparameter that we can tune.
❐ Loss function ― In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
❐ Cross-entropy loss ― In the context of binary classification in neural networks, the cross-
entropy loss L(z, y) is commonly used and is defined as follows:
(https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-
deep-learning-tips-and-tricks#parameter-tuning)
Parameter tuning
Weights initialization
❐ Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier
initialization enables to have initial weights that take into account characteristics that are
unique to the architecture.
❐ Transfer learning ― Training a deep learning model requires a lot of data and more
importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge
datasets that took days/weeks to train, and leverage it towards our use case. Depending on
how much data we have at hand, here are the different ways to leverage this:
Training Illustration Explanation
size
Optimizing convergence
❐ Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace
the weights get updated. It can be fixed or adaptively changed. The current most popular
method is called Adam, which is a method that adapts the learning rate.
❐ Adaptive learning rates ― Letting the learning rate vary when training a model can reduce
the training time and improve the numerical optimal solution. While Adam optimizer is the most
commonly used technique, others can also be useful. They are summed up in the table below:
Method Explanation Update of w Update of b
• Dampens oscillations
Momentum • Improvement to SGD w − αvdw b − αvdb
• 2 parameters to tune
algorithm by
sdw sdb
controlling oscillations
• Adaptive Moment
estimation vdw b⟵b−
Adam • Most popular −
w α vdb
sdw + ϵ
method
α
sdb + ϵ
• 4 parameters to tune
Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter
1 − p.
❐ Weight regularization ― In order to make sure that the weights are not too large and that
the model is not overfitting the training set, regularization techniques are usually performed on
the model weights. The main ones are summed up in the table below:
LASSO Ridge Elastic Net
• Shrinks coefficients to 0 Tradeoff between variab
• Good for variable selection Makes coefficients smaller selection and small
coefficients
... + λ[(1 − α)∣∣θ∣∣1 +
... + λ∣∣θ∣∣22
... + λ∣∣θ∣∣1
α∣∣θ∣∣22 ]
λ∈R λ∈R
λ ∈ R, α ∈ [0, 1]
❐ Early stopping ― This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
(https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-
deep-learning-tips-and-tricks#good-practices)
Good practices
❐ Overfitting small batch ― When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
❐ Gradient checking ― Gradient checking is a method used during the implementation of the
backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.
Type Numerical gradient Analytical gradient
Formula df dx
(x) ≈
f (x + h) − f (x − h)
2h
df
dx
(x) = f ′ (x)
• Expensive; loss has to be • 'Exact' result
computed two times per dimension • Direct computation
• Used to verify correctness of • Used in the final implementation
Comments analytical implementation
• Trade-off in choosing h not too
small (numerical instability) nor too
large (poor gradient approximation)
(https://twitter.com/shervinea) (https://linkedin.com/in/shervineamidi)
(https://github.com/shervinea) (https://scholar.google.com/citations?user=nMnMTm8AAAAJ)
(https://www.amazon.com/stores/author/B0B37XBSJL)