NN optimizers
NN optimizers
neural networks by efficiently minimizing the loss function during training. Here's an overview
of the most commonly used optimization methods:
1. Gradient Descent
Gradient descent is the simplest and most fundamental optimization algorithm. It adjusts the
neural network's parameters in the direction that decreases the loss function, based on its
gradient. There are three main variations:
● Batch Gradient Descent: Computes gradients using the entire training dataset in
one update, providing stable but slower convergence.
● Stochastic Gradient Descent (SGD): Updates parameters based on the gradient
from a single randomly selected training example, leading to faster but noisy
convergence.
● Mini-Batch Gradient Descent: Combines the advantages of both by computing
gradients on small subsets (mini-batches) of data.
2. Momentum
Momentum is an improvement upon SGD that helps accelerate training and smooths
oscillations by keeping track of previous updates. This allows the optimization to build
momentum and move more consistently toward minima.
4. AdaGrad
AdaGrad adapts the learning rate for each parameter individually, giving larger updates to
infrequent parameters and smaller updates to frequent parameters. This method is effective
for sparse data but tends to decrease the learning rate too aggressively over time, potentially
leading to slower convergence.
5. RMSProp
RMSProp improves upon AdaGrad by addressing its overly aggressive learning-rate decay.
It maintains a moving average of squared gradients to normalize the gradient updates,
adapting the learning rate more effectively during training.
6. Adam
Adam (Adaptive Moment Estimation) combines the ideas from momentum and RMSProp. It
keeps moving averages of both the gradients and their squares, which allows it to adapt
learning rates individually for each parameter. Adam is widely used due to its efficiency and
relatively straightforward tuning.
7. AdamW
8. Nadam
Nadam combines Adam optimization with Nesterov momentum, further improving the speed
of convergence and often yielding superior results compared to standard Adam.
Learning rate schedulers systematically adjust the learning rate during training:
● Step Decay: Decreases learning rate by a fixed factor after a specified number of
epochs.
● Exponential Decay: Gradually reduces learning rate exponentially over time.
● Cosine Annealing: Periodically adjusts learning rates following a cosine curve,
promoting improved exploration of the optimization landscape.
These methods use second-order derivatives (information about the curvature of the loss
surface):