Technical_writing (2)
Technical_writing (2)
Vu Quang Nam Phan Duc Dung Nguyen Thi Ha Dinh Ngoc Duyen
[email protected] [email protected] [email protected] [email protected]
Abstract
Gradient descent optimization algorithms have become increasingly prevalent, but they are often treated as black-
box tools because their strengths and limitations are not easily explained. This article seeks to equip readers with
a deeper understanding of how these algorithms behave, enabling them to apply the methods effectively. The
discussion includes an exploration of various gradient descent variants, an overview of common challenges, an
introduction to widely-used optimization techniques, a review of parallel and distributed architectures, and an
investigation of supplementary strategies for improving gradient descent performance.
1 Introduction
Gradient descent stands out as one of the most widely used algorithms for optimization, especially as the dominant method
for training neural networks. Meanwhile, modern Deep Learning libraries come equipped with implementations of various
algorithms designed to enhance the performance of gradient descent (e.g., lasagne’s2 , caffe’s3 , and keras’4 documentation).
However, these methods are frequently treated as black-box optimizers, with limited practical understanding available
about their specific strengths and weaknesses.
This article seeks to equip readers with a deeper intuition about the behavior of different gradient descent optimization
algorithms, enabling them to apply these methods more effectively. Section 2 begins by exploring the various forms of
gradient descent. Section 3 then outlines the common challenges encountered during training. Following this, Section 4
introduces key optimization algorithms, discussing their motivations for addressing these challenges and detailing the
derivation of their update rules. In Section 5, we examine parallel and distributed strategies for enhancing gradient descent.
Finally, Section 6 highlights supplementary techniques that can further optimize the gradient descent process.
Gradient descent is a fundamental approach for minimizing an objective function J(θ), which is parameterized by the
model’s parameters θ ∈ Rd . The method involves iteratively updating these parameters in the direction opposite to the
gradient of the objective function ∇θ J(θ) with respect to θ. The step size of these updates is controlled by a learning rate η,
which determines how far the parameters move at each step to approach a (local) minimum. Essentially, the algorithm
follows the slope of the objective function downhill, like descending into a valley.5
There are three main variants of gradient descent, which differ based on the amount of data used to compute the gradient of
the objective function. This choice involves a trade-off between the accuracy of parameter updates and the time required
for these updates.
1
This paper originally appeared as a blog post at http://sebastianruder.com/optimizing-gradient-descent/index.
html on 19 January 2016.
2
http://lasagne.readthedocs.org/en/latest/modules/updates.html
3
http://caffe.berkeleyvision.org/tutorial/solver.html
4
http://keras.io/optimizers/
5
For a detailed introduction to gradient descent and its application in optimizing neural networks, refer to http://cs231n.github.
io/optimization-1/
2.1 Batch Gradient Descent
Also known as vanilla gradient descent, batch gradient descent calculates the gradient of the cost function with respect to
the parameters θ using the entire training dataset:
θ = θ − η · ∇θ J(θ) (1)
Since it requires calculating gradients for the full dataset for each update, batch gradient descent can be slow and impractical
for large datasets that do not fit in memory. It also does not support online updates with new examples.
For a specified number of epochs, we initially calculate the gradient vector params_grad of the loss function for the
entire dataset in relation to our parameter vector params. It’s important to note that modern deep learning libraries offer
automatic differentiation, which efficiently computes the gradient with respect to certain parameters.If you manually derive
the gradients, performing gradient checking is advisable6 .
Subsequently, we update our parameters in the direction of the gradients, with the learning rate dictating the size of each
update. Batch gradient descent is guaranteed to converge to the global minimum for convex functions and a local minimum
for non-convex functions.
In contrast, stochastic gradient descent (SGD) updates parameters for each training example x(i) and label y (i) :
Batch gradient descent involves redundant calculations for large datasets, as it recalculates gradients for similar examples
prior to each parameter update. In contrast, stochastic gradient descent (SGD) eliminates this redundancy by updating
parameters one at a time, making it generally much faster and suitable for online learning. SGD frequently updates with
high variance, which leads to significant fluctuations in the objective function, as illustrated in Figure 1.
While batch gradient descent converges to the minimum within the basin of parameters, the fluctuations of SGD allow it to
explore new and potentially better local minima. However, this can complicate the convergence to the exact minimum, as
SGD tends to overshoot. Nonetheless, studies have shown that by gradually reducing the learning rate, SGD can exhibit
convergence behavior similar to that of batch gradient descent, likely converging to either a local or global minimum for
non-convex and convex optimization, respectively. The code for SGD simply includes a loop over the training examples,
evaluating the gradient for each one. Additionally, it’s important to shuffle the training data at every epoch, as detailed in
Section 6.1.
f o r i in range ( nb_epochs ) :
np . random . s h u f f l e ( d a t a )
f o r example i n d a t a :
p a r a m s _ g r a d = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n , example , p a r a m s )
params = params − l e a r n i n g _ r a t e * params_grad
Mini-batch gradient descent combines the advantages of both methods by performing updates for every mini-batch of n
training examples:
θ = θ − η · ∇θ J(θ; x(i:i+n) ; y (i:i+n) ) (3)
6
Refer to https://cs231n.github.io/neural-networks-3/ for some great tips on how to check gradients properly.
Figure 1: SGD fluctuation (Source: Wikipedia)
This approach (a) decreases the variance of parameter updates, leading to more stable convergence, and (b) utilizes highly
optimized matrix operations found in advanced deep learning libraries, making the computation of gradients for a mini-batch
very efficient. Typical mini-batch sizes range from 50 to 256, though this can vary based on specific applications. Mini-batch
gradient descent is generally preferred for training neural networks, and the term SGD is often used even when mini-batches
are involved. Note that in the subsequent modifications of SGD, we will omit the parameters x(i:i+n) and y (i:i+n) for
simplicity.
In code, rather than iterating over individual examples, we now iterate over mini-batches of size 50:
f o r i in range ( nb_epochs ) :
np . random . s h u f f l e ( d a t a )
fo r batch in get_batches ( data , b a t c h _ s i z e =50):
params_grad = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n , batch , params )
params = params − l e a r n i n g _ r a t e * params_grad
3 Challenges
However, vanilla mini-batch gradient descent does not ensure effective convergence and presents several challenges that
need to be addressed:
– Selecting an appropriate learning rate can be challenging. A rate that is too low results in sluggish convergence, while
a rate that is too high can impede convergence and lead to fluctuations in the loss function around the minimum or
even cause divergence.
– Learning rate schedules [18] aim to adjust the learning rate during training, such as through annealing, where the
rate is reduced based on a predefined schedule or when the change in the objective between epochs falls below a
certain threshold. However, these schedules and thresholds must be set in advance and cannot adapt to the specific
characteristics of the dataset[4].
– Applying the same learning rate for all parameter updates can be problematic. In cases where the data is sparse and
features exhibit varying frequencies, it may be preferable to update some features more significantly than others,
particularly those that occur infrequently.
– Another significant challenge in minimizing the highly non-convex error functions typical of neural networks is
avoiding entrapment in the many suboptimal local minima. Dauphin et al.[20] argue that the real issue arises not from
local minima but from saddle points—locations where one dimension inclines while another declines. These saddle
points are often encircled by flat regions of similar error, making it particularly difficult for SGD to escape, as the
gradient approaches zero in all directions.
This section describes some algorithms commonly utilized by the Deep Learning community to address the stated challenges.
Algorithms that are computationally impractical for high-dimensional datasets, such as second-order methods like Newton’s
method7 , will not be covered.
4.1 Momentum
Stochastic Gradient Descent (SGD) struggles to navigate ravines—regions where the surface has steep curvature in one
direction but gentle curvature in another [19]. These areas, commonly found near local optima, cause SGD to oscillate
along the steep slopes while making slow progress at the bottom toward the optimum (see Figure 2a).
Momentum [17] addresses this issue by accelerating SGD in the desired direction and minimizing oscillations, as illustrated
in Figure 2b. It achieves this by adding a fraction γ of the previous update vector8 , to the current one:
A ball rolling downhill that simply follows the slope is not ideal. We want a more intelligent ball—one that anticipates
upcoming slopes and adjusts its speed accordingly. Nesterov Accelerated Gradient (NAG) [14] introduces this level of
foresight into the momentum term.
NAG leverages the momentum term γvt−1 to estimate where the parameters θ will be in the next step. By using this
predicted position, we can compute the gradient based on this approximation rather than the current position. This method
allows us to anticipate and correct our updates effectively:
The value of the momentum term θ is typically set around 0.9. Unlike standard momentum, which calculates the current
gradient and then applies an update in that direction, NAG takes a larger initial step in the direction of the previous
momentum (depicted as a brown vector), computes the gradient, and then makes a corrective adjustment (green vector)
(see Figure 3).
This anticipatory behavior prevents overshooting and enhances responsiveness, contributing to significantly improved
7
https://en.wikipedia.org/wiki/Newton27s_method_in_optimization
8
Some implementations exchange the signs in the equations.
Figure 3: Nesterov update (Source: G. Hinton’s lecture 6c)
4.3 Adagrad
Adagrad [6] is an optimization algorithm that adjusts the learning rate for each parameter, allowing for larger updates on
infrequent parameters and smaller updates on frequent ones. This feature makes it particularly effective for sparse data. For
instance, Dean et al. [5] demonstrated that Adagrad enhanced the robustness of SGD, using it to train large-scale neural
networks at Google, which famously learned to identify cats in YouTube videos10 ,. Additionally, Pennington et al. [16]
employed Adagrad for training GloVe word embeddings, benefiting from its ability to provide larger updates for less
frequent words.
Unlike standard SGD, which uses a fixed learning rate for all parameters, Adagrad calculates a unique learning rate for
each parameter at every time step. Let gt,i represent the gradient of the objective function with respect to parameter i at
time t. The update for parameter θi becomes:
η
θt+1,i = θt,i − p gt,i (6)
Gt,ii +
Here, Gt is a diagonal matrix where each diagonal element, Gt,ii , is the sum of the squares of gradients for parameter i
up to time t11 ,. The term prevents division by zero, typically set around 10−8 . Without the square root operation, the
algorithm performs significantly worse.
Adagrad simplifies optimization by removing the need for manual tuning of the learning rate. A default value of η = 0.01 is
commonly used. However, a drawback of Adagrad is the continuous accumulation of squared gradients in Gt , which causes
the learning rate to diminish over time. This eventually reduces the algorithm’s ability to learn, prompting the development
of newer methods to address this limitation.
4.4 RMSprop
RMSprop is an adaptive learning rate optimization technique introduced by Geoff Hinton in Lecture 6e of his Coursera
course12 ,. Although it remains unpublished, RMSprop has gained widespread use in the field of deep learning.
Both RMSprop and Adadelta were independently created to address the problem of Adagrad’s rapidly decreasing learning
Refer to http://cs231n.github.io/neural-networks-3/ for another explanation of the intuitions behind NAG, while Ilya
9
RMSprop adjusts the learning rate by dividing it by an exponentially decaying average of squared gradients. Hinton suggests
setting the decay γ to 0.9, with η = 0.001 being a commonly used default learning rate.
4.5 Adam
Adaptive Moment Estimation (Adam) [10] is another optimization algorithm that calculates adaptive learning rates for
each parameter. Like Adadelta and RMSprop, Adam uses an exponentially decaying average of past squared gradients (vt ).
Additionally, it incorporates an exponentially decaying average of past gradients (mt ), similar to the momentum method:
mt = β1 mt−1 + (1 − β1 )gt
(8)
vt = β2 vt−1 + (1 − β2 )gt2
Here, mt and vt represent estimates of the first moment (mean) and second moment (uncentered variance) of the gradients,
respectively, which gives the method its name.
Since both mt and vt are initialized to zero, they are initially biased towards zero, especially when the decay rates (β1 and
β2 ) are close to 1. To address this, Adam uses bias-corrected estimates for the first and second moments:
mt
m̂t =
1 − β1t
vt (9)
v̂t =
1 − β2t
The parameters are then updated using these corrected estimates, following a rule similar to those in Adadelta and RMSprop:
η m̂t
θt+1 = θt − √ (10)
v̂t +
The authors recommend default values of β1 = 0.9, β2 = 0.999, and = 10−8 . Empirical results indicate that Adam
performs well in practice and often outperforms other adaptive learning rate methods.
Given the pervasiveness of large-scale data solutions and the accessibility of inexpensive clusters, enhancing SGD by
distributing its processes for greater speed becomes a natural option. Stochastic Gradient Descent, by its very nature,
operates in a sequential manner: each step incrementally moves closer to the minimum. Running it provides effective
convergence; it can be time-consuming, particularly on large datasets. Conversely, running SGD asynchronously is faster,
but suboptimal communication between workers may result in poor convergence. Moreover, we can also parallelize SGD
on a single machine without the need for a large computing cluster. The following are algorithms and frameworks that have
been proposed to optimize parallelized and distributed SGD.
5.1 Hogwild!
Niu et al. [15] introduce an update scheme named Hogwild!, which enables parallel execution of SGD updates on CPUs.
Instead of locking the parameters, processors can directly access shared memory. This approach is effective only when the
input data is sparse, as each update adjusts only a small portion of all parameters. Their findings demonstrate that, under
these conditions, the update scheme attains a near-optimal convergence rate as it is unlikely that processors will overwrite
useful information.
5.2 DownPour SGD
Downpour SGD is an asynchronous version of SGD employed by Dean et al. [7] in their DistBelief framework at Google,
which preceded TensorFlow. This method operates by running multiple model replicas in parallel on subsets of the training
data. These models transmit their updates to a parameter server, which is split across many machines. Each machine handles
the storage and updating of a portion of the model’s parameters. However, because replicas do not interact with each other,
such as sharing weights or updates, their parameters remain prone to divergence, which can impede convergence.
McMahan and Streeter [12] expand AdaGrad into a parallel context by developing delay-tolerant algorithms that adjust not
only to past gradients, but also to update delays. It has been shown in practice that this works well.
5.4 TensorFlow
TensorFlow14 [1] is a framework recently open-sourced by Google for implementing and deploying large-scale machine
learning models. Built upon their experience with DistBelief, it is already utilized internally to execute computations on
a large range of mobile devices as well as on large-scale distributed systems. The distributed version, released in April
201615 , splits a computation graph into subgraphs for each device, while communication occurring through Send/Receive
node pairs.
Zhang et al. [22] introduce Elastic Averaging SGD (EASGD), which connects the parameters of the workers of asynchronous
SGD to a center variable stored on the parameter server using an elastic force. This permits the local variables to fluctuate
further from the center variable, theoretically enabling greater exploration of the parameter space. They show empirically
that this enhanced exploration capability improves performance by discovering new local optima.
Finally, we present additional strategies that can be used alongside any of the previously mentioned algorithms to enhance
the performance of SGD. For a comprehensive summary of some other common tricks, refer to [11].
In general, we want to avoid presenting training examples to the model in an organized sequence, as this could introduce
biases into the optimization process. Therefore, shuffling the training dataset after completing each epoch is commonly
suggested. Conversely, in scenarios where the goal is to tackle progressively harder problems, supplying the training
examples in a meaningful order may actually lead to improved performance and better convergence. This strategy is known
as Curriculum Learning [2]. Zaremba and Sutskever [21] illustrated that training LSTMs to process simple programs was
achievable only through the use of Curriculum Learning. Moreover, they demonstrated that combining or mixing strategies
yielded better results than the straightforward approach of ordering examples by ascending difficulty.
To facilitate learning, parameter values are often standardized initially by initializing them with zero mean and unit variance.
As training progresses and we update parameters to different extents, we lose this standardization, which slows down
training and amplifies changes as the network becomes deeper. Batch normalization [9] restores these normalizations for
every mini-batch and changes are backpropagated through the operation as well. By making normalization part of the
model architecture, higher learning rates can be utilized and less attention needs to be paid to the initialization parameters.
Batch normalization additionally acts as a regularizer, often decreasing or even eliminating the need for Dropout.
6.3 Early stopping
As Geoff Hinton describes, "Early stopping is beautiful free lunch"16 . Therefore, it is crucial to continuously observe the
error on a validation set during training and terminate the process, after allowing for some patience, if the validation error
fails to improve sufficiently.
Neelakantan et al. [13] introduce a disturbance that adheres to a Gaussian distribution N (0, σt2 ) to each gradient update:
η
σt2 = (12)
(1 + t)γ
They demonstrate that adding this disturbance makes networks more resilient to poor initialization and assists in training,
especially for deep and intricate models. They propose that the added noise provides the model with more opportunities to
avoid and identify new local minima, which are more frequent in deeper architectures.
7 Conclusion
In this article, the three variants of gradient descent have been initially looked at, among which the minibatch gradient
descent is the most popular. We have then researched algorithms that are most frequently used for optimizing SGD:
Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, as well as various
algorithms to optimize nonsynchronous SGD. Lastly, other strategies to improve SGD, such as shuffling, curriculum
learning, batch normalization, and early stopping, have been considered.
Acknowledgements
The heading of section ‘Acknowledgement’ must be 10 pt, bold, left justified, with 12pt empty space before and 6pt after.
It is absolutely imperative that the references are formatted correctly, i.e. first comes the abbreviation in square brackets,
then follows the second name of the author followed by abbreviation of the first name.
References
[1] Martín Abadi et al. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”. In: (Mar.
2016). DOI: 10.48550/arXiv.1603.04467.
[2] Y. Bengio et al. “Curriculum learning”. In: vol. 60. June 2009, p. 6. DOI: 10.1145/1553374.1553380.
[3] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. “Advances in optimizing recurrent net-
works”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 8624–8628.
URL: https://api.semanticscholar.org/CorpusID:12485056.
[4] J. Chang C. Darken and J. Moody. “Learning rate schedules for faster stochastic gradient search”. In: Neural Networks for Signal Proc
(1992), pp. 1–11.
[5] Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in neural information processing systems
(Oct. 2012).
[6] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization”. In: Journal of Machine Learning Research 12 (July 2011), pp. 2121–2159.
[7] Diandian Gu et al. Demystifying Developers’ Issues in Distributed Training of Deep Learning Software. Dec. 2021.
DOI: 10.48550/arXiv.2112.06222.
[8] Geoffrey E. Hinton and Ilya Sutskever. “Training Recurrent Neural Networks”. In: 2013. URL: https://api.
semanticscholar.org/CorpusID:61713861.
[9] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift”. In: (Feb. 2015).
[10] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980
(2014). URL: https://api.semanticscholar.org/CorpusID:6628106.
[11] Yann Lecun et al. “Efficient BackProp”. In: (Aug. 2000).
[12] H.B. McMahan and M. Streeter. “Delay-tolerant algorithms for asynchronous distributed online learning”. In:
Advances in Neural Information Processing Systems 4 (Jan. 2014), pp. 2915–2923.
[13] Arvind Neelakantan et al. “Adding Gradient Noise Improves Learning for Very Deep Networks”. In: (Nov. 2015).
DOI: 10.48550/arXiv.1511.06807.
[14] Yurii Nesterov. “A method for unconstrained convex minimization problem with the rate of convergence o(1/k2 )”.
In: 1983. URL: https://api.semanticscholar.org/CorpusID:202149403.
[15] Feng Niu et al. “HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”. In: NIPS 24
(June 2011).
[16] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global Vectors for Word Representation”.
In: vol. 14. Jan. 2014, pp. 1532–1543. DOI: 10.3115/v1/D14-1162.
[17] Ning Qian. “On the momentum term in gradient descent learning algorithms”. In: Neural networks: the official journal of the Interna
12 1 (1999), pp. 145–151. URL: https://api.semanticscholar.org/CorpusID:2783597.
[18] Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics
22.3 (1951), pp. 400–407.
[19] Richard S. Sutton. “Two problems with backpropagation and other steepest-descent learning procedures for networks”.
In: (1986). DOI: https://escholarship.org/uc/item/2z66t36g.
[20] Caglar Gulcehre Kyunghyun Cho Surya Ganguli Yann N. Dauphin Razvan Pascanu and Yoshua Bengio. “Identifying
and attacking the saddle point problem in high-dimensional nonconvex optimization”. In: arXiv (2014), pp. 1–14.
[21] Wojciech Zaremba and Ilya Sutskever. “Learning to Execute”. In: (Oct. 2014).
[22] Sixn Zhang, Anna Choromanska, and Yann Lecun. “Deep learning with Elastic Averaging SGD”. In: (Dec. 2014).