0% found this document useful (0 votes)

20 views

Technical_writing (2)

technical 2

Uploaded by

n30.nd2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Technical_writing (2)

technical 2

Uploaded by

n30.nd2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

An overview of gradient descent optimization algorithms1

Vu Quang Nam Phan Duc Dung Nguyen Thi Ha Dinh Ngoc Duyen
[email protected] [email protected] [email protected] [email protected]

School of Information and Technology, Hanoi University of Science and Technology

Abstract

Gradient descent optimization algorithms have become increasingly prevalent, but they are often treated as black-
box tools because their strengths and limitations are not easily explained. This article seeks to equip readers with
a deeper understanding of how these algorithms behave, enabling them to apply the methods effectively. The
discussion includes an exploration of various gradient descent variants, an overview of common challenges, an
introduction to widely-used optimization techniques, a review of parallel and distributed architectures, and an
investigation of supplementary strategies for improving gradient descent performance.

1 Introduction

Gradient descent stands out as one of the most widely used algorithms for optimization, especially as the dominant method
for training neural networks. Meanwhile, modern Deep Learning libraries come equipped with implementations of various
algorithms designed to enhance the performance of gradient descent (e.g., lasagne’s2 , caffe’s3 , and keras’4 documentation).
However, these methods are frequently treated as black-box optimizers, with limited practical understanding available
about their specific strengths and weaknesses.
This article seeks to equip readers with a deeper intuition about the behavior of different gradient descent optimization
algorithms, enabling them to apply these methods more effectively. Section 2 begins by exploring the various forms of
gradient descent. Section 3 then outlines the common challenges encountered during training. Following this, Section 4
introduces key optimization algorithms, discussing their motivations for addressing these challenges and detailing the
derivation of their update rules. In Section 5, we examine parallel and distributed strategies for enhancing gradient descent.
Finally, Section 6 highlights supplementary techniques that can further optimize the gradient descent process.
Gradient descent is a fundamental approach for minimizing an objective function J(θ), which is parameterized by the
model’s parameters θ ∈ Rd . The method involves iteratively updating these parameters in the direction opposite to the
gradient of the objective function ∇θ J(θ) with respect to θ. The step size of these updates is controlled by a learning rate η,
which determines how far the parameters move at each step to approach a (local) minimum. Essentially, the algorithm
follows the slope of the objective function downhill, like descending into a valley.5

2 Gradient Descent Variants

There are three main variants of gradient descent, which differ based on the amount of data used to compute the gradient of
the objective function. This choice involves a trade-off between the accuracy of parameter updates and the time required
for these updates.
1
This paper originally appeared as a blog post at http://sebastianruder.com/optimizing-gradient-descent/index.
html on 19 January 2016.
2
http://lasagne.readthedocs.org/en/latest/modules/updates.html
3
http://caffe.berkeleyvision.org/tutorial/solver.html
4
http://keras.io/optimizers/
5
For a detailed introduction to gradient descent and its application in optimizing neural networks, refer to http://cs231n.github.
io/optimization-1/
2.1 Batch Gradient Descent

Also known as vanilla gradient descent, batch gradient descent calculates the gradient of the cost function with respect to
the parameters θ using the entire training dataset:

θ = θ − η · ∇θ J(θ) (1)

Since it requires calculating gradients for the full dataset for each update, batch gradient descent can be slow and impractical
for large datasets that do not fit in memory. It also does not support online updates with new examples.

In code, batch gradient descent might look like this:

f o r i in range ( nb_epochs ) :
params_grad = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n , data , params )
params = params − l e a r n i n g _ r a t e * params_grad

For a specified number of epochs, we initially calculate the gradient vector params_grad of the loss function for the
entire dataset in relation to our parameter vector params. It’s important to note that modern deep learning libraries offer
automatic differentiation, which efficiently computes the gradient with respect to certain parameters.If you manually derive
the gradients, performing gradient checking is advisable6 .
Subsequently, we update our parameters in the direction of the gradients, with the learning rate dictating the size of each
update. Batch gradient descent is guaranteed to converge to the global minimum for convex functions and a local minimum
for non-convex functions.

2.2 Stochastic Gradient Descent

In contrast, stochastic gradient descent (SGD) updates parameters for each training example x(i) and label y (i) :

θ = θ − η · ∇θ J(θ; x(i) ; y (i) ) (2)

Batch gradient descent involves redundant calculations for large datasets, as it recalculates gradients for similar examples
prior to each parameter update. In contrast, stochastic gradient descent (SGD) eliminates this redundancy by updating
parameters one at a time, making it generally much faster and suitable for online learning. SGD frequently updates with
high variance, which leads to significant fluctuations in the objective function, as illustrated in Figure 1.
While batch gradient descent converges to the minimum within the basin of parameters, the fluctuations of SGD allow it to
explore new and potentially better local minima. However, this can complicate the convergence to the exact minimum, as
SGD tends to overshoot. Nonetheless, studies have shown that by gradually reducing the learning rate, SGD can exhibit
convergence behavior similar to that of batch gradient descent, likely converging to either a local or global minimum for
non-convex and convex optimization, respectively. The code for SGD simply includes a loop over the training examples,
evaluating the gradient for each one. Additionally, it’s important to shuffle the training data at every epoch, as detailed in
Section 6.1.
f o r i in range ( nb_epochs ) :
np . random . s h u f f l e ( d a t a )
f o r example i n d a t a :
p a r a m s _ g r a d = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n , example , p a r a m s )
params = params − l e a r n i n g _ r a t e * params_grad

2.3 Mini-Batch Gradient Descent

Mini-batch gradient descent combines the advantages of both methods by performing updates for every mini-batch of n
training examples:
θ = θ − η · ∇θ J(θ; x(i:i+n) ; y (i:i+n) ) (3)
6
Refer to https://cs231n.github.io/neural-networks-3/ for some great tips on how to check gradients properly.
Figure 1: SGD fluctuation (Source: Wikipedia)

This approach (a) decreases the variance of parameter updates, leading to more stable convergence, and (b) utilizes highly
optimized matrix operations found in advanced deep learning libraries, making the computation of gradients for a mini-batch
very efficient. Typical mini-batch sizes range from 50 to 256, though this can vary based on specific applications. Mini-batch
gradient descent is generally preferred for training neural networks, and the term SGD is often used even when mini-batches
are involved. Note that in the subsequent modifications of SGD, we will omit the parameters x(i:i+n) and y (i:i+n) for
simplicity.
In code, rather than iterating over individual examples, we now iterate over mini-batches of size 50:
f o r i in range ( nb_epochs ) :
np . random . s h u f f l e ( d a t a )
fo r batch in get_batches ( data , b a t c h _ s i z e =50):
params_grad = e v a l u a t e _ g r a d i e n t ( l o s s _ f u n c t i o n , batch , params )
params = params − l e a r n i n g _ r a t e * params_grad

3 Challenges

However, vanilla mini-batch gradient descent does not ensure effective convergence and presents several challenges that
need to be addressed:

– Selecting an appropriate learning rate can be challenging. A rate that is too low results in sluggish convergence, while
a rate that is too high can impede convergence and lead to fluctuations in the loss function around the minimum or
even cause divergence.
– Learning rate schedules [18] aim to adjust the learning rate during training, such as through annealing, where the
rate is reduced based on a predefined schedule or when the change in the objective between epochs falls below a
certain threshold. However, these schedules and thresholds must be set in advance and cannot adapt to the specific
characteristics of the dataset[4].
– Applying the same learning rate for all parameter updates can be problematic. In cases where the data is sparse and
features exhibit varying frequencies, it may be preferable to update some features more significantly than others,
particularly those that occur infrequently.
– Another significant challenge in minimizing the highly non-convex error functions typical of neural networks is
avoiding entrapment in the many suboptimal local minima. Dauphin et al.[20] argue that the real issue arises not from
local minima but from saddle points—locations where one dimension inclines while another declines. These saddle
points are often encircled by flat regions of similar error, making it particularly difficult for SGD to escape, as the
gradient approaches zero in all directions.

4 Gradient descent optimization algorithms

This section describes some algorithms commonly utilized by the Deep Learning community to address the stated challenges.
Algorithms that are computationally impractical for high-dimensional datasets, such as second-order methods like Newton’s
method7 , will not be covered.

4.1 Momentum

Stochastic Gradient Descent (SGD) struggles to navigate ravines—regions where the surface has steep curvature in one
direction but gentle curvature in another [19]. These areas, commonly found near local optima, cause SGD to oscillate
along the steep slopes while making slow progress at the bottom toward the optimum (see Figure 2a).

Figure 2: Source: Genevieve B. Orr

Momentum [17] addresses this issue by accelerating SGD in the desired direction and minimizing oscillations, as illustrated
in Figure 2b. It achieves this by adding a fraction γ of the previous update vector8 , to the current one:

vt = γvt−1 + η∇θ J(θ)

(4)
θ = θ − vt

Here, the momentum term γ is typically set to 0.9 or a similar value.

The concept can be likened to pushing a ball downhill. The ball gathers momentum as it descends, moving increasingly
faster until it reaches a terminal velocity (assuming air resistance is less than 1). Similarly, in optimization, the momentum
term amplifies updates in dimensions where gradients remain consistent and reduces updates where gradients vary in
direction. This leads to faster convergence and smoother progress, reducing oscillations.

4.2 Nesterov Accelerated Gradient

A ball rolling downhill that simply follows the slope is not ideal. We want a more intelligent ball—one that anticipates
upcoming slopes and adjusts its speed accordingly. Nesterov Accelerated Gradient (NAG) [14] introduces this level of
foresight into the momentum term.
NAG leverages the momentum term γvt−1 to estimate where the parameters θ will be in the next step. By using this
predicted position, we can compute the gradient based on this approximation rather than the current position. This method
allows us to anticipate and correct our updates effectively:

vt = γvt−1 + η∇θ J(θ − γvt−1 )

(5)
θ = θ − vt

The value of the momentum term θ is typically set around 0.9. Unlike standard momentum, which calculates the current
gradient and then applies an update in that direction, NAG takes a larger initial step in the direction of the previous
momentum (depicted as a brown vector), computes the gradient, and then makes a corrective adjustment (green vector)
(see Figure 3).
This anticipatory behavior prevents overshooting and enhances responsiveness, contributing to significantly improved
7
https://en.wikipedia.org/wiki/Newton27s_method_in_optimization
8
Some implementations exchange the signs in the equations.
Figure 3: Nesterov update (Source: G. Hinton’s lecture 6c)

performance in recurrent neural networks (RNNs) for various tasks [3].9 ,

By integrating these principles, NAG not only adapts updates according to the slope of the error function, improving SGD
efficiency, but also allows updates to vary for each parameter, ensuring that parameters of higher importance receive more
precise adjustments.

4.3 Adagrad

Adagrad [6] is an optimization algorithm that adjusts the learning rate for each parameter, allowing for larger updates on
infrequent parameters and smaller updates on frequent ones. This feature makes it particularly effective for sparse data. For
instance, Dean et al. [5] demonstrated that Adagrad enhanced the robustness of SGD, using it to train large-scale neural
networks at Google, which famously learned to identify cats in YouTube videos10 ,. Additionally, Pennington et al. [16]
employed Adagrad for training GloVe word embeddings, benefiting from its ability to provide larger updates for less
frequent words.

Unlike standard SGD, which uses a fixed learning rate for all parameters, Adagrad calculates a unique learning rate for
each parameter at every time step. Let gt,i represent the gradient of the objective function with respect to parameter i at
time t. The update for parameter θi becomes:
η
θt+1,i = θt,i − p gt,i (6)
Gt,ii +

Here, Gt is a diagonal matrix where each diagonal element, Gt,ii , is the sum of the squares of gradients for parameter i
up to time t11 ,. The term prevents division by zero, typically set around 10−8 . Without the square root operation, the
algorithm performs significantly worse.
Adagrad simplifies optimization by removing the need for manual tuning of the learning rate. A default value of η = 0.01 is
commonly used. However, a drawback of Adagrad is the continuous accumulation of squared gradients in Gt , which causes
the learning rate to diminish over time. This eventually reduces the algorithm’s ability to learn, prompting the development
of newer methods to address this limitation.

4.4 RMSprop

RMSprop is an adaptive learning rate optimization technique introduced by Geoff Hinton in Lecture 6e of his Coursera
course12 ,. Although it remains unpublished, RMSprop has gained widespread use in the field of deep learning.
Both RMSprop and Adadelta were independently created to address the problem of Adagrad’s rapidly decreasing learning
Refer to http://cs231n.github.io/neural-networks-3/ for another explanation of the intuitions behind NAG, while Ilya
9

Sutskever gives a more detailed overview in his PhD thesis [8].

10
https://www.wired.com/2012/06/google-x-neural-network/
11
Duchi et al. [6] give this matrix as an alternative to the full matrix containing the outer products of all previous gradients, as the
computation of the matrix square root is infeasible even for a moderate number of parameters d.
12
https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
rates. In fact, RMSprop is equivalent to the first update step in Adadelta. The update rule for RMSprop is as follows:

E[g 2 ]t = 0.9E[g 2 ]t−1 + 0.1gt2

η (7)
θt+1 = θt − p gt
E[g 2 ]t +

RMSprop adjusts the learning rate by dividing it by an exponentially decaying average of squared gradients. Hinton suggests
setting the decay γ to 0.9, with η = 0.001 being a commonly used default learning rate.

4.5 Adam

Adaptive Moment Estimation (Adam) [10] is another optimization algorithm that calculates adaptive learning rates for
each parameter. Like Adadelta and RMSprop, Adam uses an exponentially decaying average of past squared gradients (vt ).
Additionally, it incorporates an exponentially decaying average of past gradients (mt ), similar to the momentum method:

mt = β1 mt−1 + (1 − β1 )gt
(8)
vt = β2 vt−1 + (1 − β2 )gt2

Here, mt and vt represent estimates of the first moment (mean) and second moment (uncentered variance) of the gradients,
respectively, which gives the method its name.
Since both mt and vt are initialized to zero, they are initially biased towards zero, especially when the decay rates (β1 and
β2 ) are close to 1. To address this, Adam uses bias-corrected estimates for the first and second moments:
mt
m̂t =
1 − β1t
vt (9)
v̂t =
1 − β2t

The parameters are then updated using these corrected estimates, following a rule similar to those in Adadelta and RMSprop:

η m̂t
θt+1 = θt − √ (10)
v̂t +

The authors recommend default values of β1 = 0.9, β2 = 0.999, and = 10−8 . Empirical results indicate that Adam
performs well in practice and often outperforms other adaptive learning rate methods.

5 Parallelizing and distributing SGD

Given the pervasiveness of large-scale data solutions and the accessibility of inexpensive clusters, enhancing SGD by
distributing its processes for greater speed becomes a natural option. Stochastic Gradient Descent, by its very nature,
operates in a sequential manner: each step incrementally moves closer to the minimum. Running it provides effective
convergence; it can be time-consuming, particularly on large datasets. Conversely, running SGD asynchronously is faster,
but suboptimal communication between workers may result in poor convergence. Moreover, we can also parallelize SGD
on a single machine without the need for a large computing cluster. The following are algorithms and frameworks that have
been proposed to optimize parallelized and distributed SGD.

5.1 Hogwild!

Niu et al. [15] introduce an update scheme named Hogwild!, which enables parallel execution of SGD updates on CPUs.
Instead of locking the parameters, processors can directly access shared memory. This approach is effective only when the
input data is sparse, as each update adjusts only a small portion of all parameters. Their findings demonstrate that, under
these conditions, the update scheme attains a near-optimal convergence rate as it is unlikely that processors will overwrite
useful information.
5.2 DownPour SGD

Downpour SGD is an asynchronous version of SGD employed by Dean et al. [7] in their DistBelief framework at Google,
which preceded TensorFlow. This method operates by running multiple model replicas in parallel on subsets of the training
data. These models transmit their updates to a parameter server, which is split across many machines. Each machine handles
the storage and updating of a portion of the model’s parameters. However, because replicas do not interact with each other,
such as sharing weights or updates, their parameters remain prone to divergence, which can impede convergence.

5.3 Delay-tolerant Algorithms for SGD

McMahan and Streeter [12] expand AdaGrad into a parallel context by developing delay-tolerant algorithms that adjust not
only to past gradients, but also to update delays. It has been shown in practice that this works well.

5.4 TensorFlow

TensorFlow14 [1] is a framework recently open-sourced by Google for implementing and deploying large-scale machine
learning models. Built upon their experience with DistBelief, it is already utilized internally to execute computations on
a large range of mobile devices as well as on large-scale distributed systems. The distributed version, released in April
201615 , splits a computation graph into subgraphs for each device, while communication occurring through Send/Receive
node pairs.

5.5 Elastic Averaging SGD

Zhang et al. [22] introduce Elastic Averaging SGD (EASGD), which connects the parameters of the workers of asynchronous
SGD to a center variable stored on the parameter server using an elastic force. This permits the local variables to fluctuate
further from the center variable, theoretically enabling greater exploration of the parameter space. They show empirically
that this enhanced exploration capability improves performance by discovering new local optima.

6 Additional strategies for optimizing SGD

Finally, we present additional strategies that can be used alongside any of the previously mentioned algorithms to enhance
the performance of SGD. For a comprehensive summary of some other common tricks, refer to [11].

6.1 Shuffling and Curriculum Learning

In general, we want to avoid presenting training examples to the model in an organized sequence, as this could introduce
biases into the optimization process. Therefore, shuffling the training dataset after completing each epoch is commonly
suggested. Conversely, in scenarios where the goal is to tackle progressively harder problems, supplying the training
examples in a meaningful order may actually lead to improved performance and better convergence. This strategy is known
as Curriculum Learning [2]. Zaremba and Sutskever [21] illustrated that training LSTMs to process simple programs was
achievable only through the use of Curriculum Learning. Moreover, they demonstrated that combining or mixing strategies
yielded better results than the straightforward approach of ordering examples by ascending difficulty.

6.2 Batch normalization

To facilitate learning, parameter values are often standardized initially by initializing them with zero mean and unit variance.
As training progresses and we update parameters to different extents, we lose this standardization, which slows down
training and amplifies changes as the network becomes deeper. Batch normalization [9] restores these normalizations for
every mini-batch and changes are backpropagated through the operation as well. By making normalization part of the
model architecture, higher learning rates can be utilized and less attention needs to be paid to the initialization parameters.
Batch normalization additionally acts as a regularizer, often decreasing or even eliminating the need for Dropout.
6.3 Early stopping

As Geoff Hinton describes, "Early stopping is beautiful free lunch"16 . Therefore, it is crucial to continuously observe the
error on a validation set during training and terminate the process, after allowing for some patience, if the validation error
fails to improve sufficiently.

6.4 Gradient noise

Neelakantan et al. [13] introduce a disturbance that adheres to a Gaussian distribution N (0, σt2 ) to each gradient update:

gt,i = gt,i + N (0, σt2 ) (11)

They reduce the variance following the schedule:

η
σt2 = (12)
(1 + t)γ
They demonstrate that adding this disturbance makes networks more resilient to poor initialization and assists in training,
especially for deep and intricate models. They propose that the added noise provides the model with more opportunities to
avoid and identify new local minima, which are more frequent in deeper architectures.

7 Conclusion

In this article, the three variants of gradient descent have been initially looked at, among which the minibatch gradient
descent is the most popular. We have then researched algorithms that are most frequently used for optimizing SGD:
Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam, as well as various
algorithms to optimize nonsynchronous SGD. Lastly, other strategies to improve SGD, such as shuffling, curriculum
learning, batch normalization, and early stopping, have been considered.

Acknowledgements

The heading of section ‘Acknowledgement’ must be 10 pt, bold, left justified, with 12pt empty space before and 6pt after.
It is absolutely imperative that the references are formatted correctly, i.e. first comes the abbreviation in square brackets,
then follows the second name of the author followed by abbreviation of the first name.

References

[1] Martín Abadi et al. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”. In: (Mar.
2016). DOI: 10.48550/arXiv.1603.04467.
[2] Y. Bengio et al. “Curriculum learning”. In: vol. 60. June 2009, p. 6. DOI: 10.1145/1553374.1553380.
[3] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. “Advances in optimizing recurrent net-
works”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2012), pp. 8624–8628.
URL: https://api.semanticscholar.org/CorpusID:12485056.
[4] J. Chang C. Darken and J. Moody. “Learning rate schedules for faster stochastic gradient search”. In: Neural Networks for Signal Proc
(1992), pp. 1–11.
[5] Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. In: Advances in neural information processing systems
(Oct. 2012).
[6] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization”. In: Journal of Machine Learning Research 12 (July 2011), pp. 2121–2159.
[7] Diandian Gu et al. Demystifying Developers’ Issues in Distributed Training of Deep Learning Software. Dec. 2021.
DOI: 10.48550/arXiv.2112.06222.
[8] Geoffrey E. Hinton and Ilya Sutskever. “Training Recurrent Neural Networks”. In: 2013. URL: https://api.
semanticscholar.org/CorpusID:61713861.
[9] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift”. In: (Feb. 2015).
[10] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980
(2014). URL: https://api.semanticscholar.org/CorpusID:6628106.
[11] Yann Lecun et al. “Efficient BackProp”. In: (Aug. 2000).
[12] H.B. McMahan and M. Streeter. “Delay-tolerant algorithms for asynchronous distributed online learning”. In:
Advances in Neural Information Processing Systems 4 (Jan. 2014), pp. 2915–2923.
[13] Arvind Neelakantan et al. “Adding Gradient Noise Improves Learning for Very Deep Networks”. In: (Nov. 2015).
DOI: 10.48550/arXiv.1511.06807.
[14] Yurii Nesterov. “A method for unconstrained convex minimization problem with the rate of convergence o(1/k2 )”.
In: 1983. URL: https://api.semanticscholar.org/CorpusID:202149403.
[15] Feng Niu et al. “HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent”. In: NIPS 24
(June 2011).
[16] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global Vectors for Word Representation”.
In: vol. 14. Jan. 2014, pp. 1532–1543. DOI: 10.3115/v1/D14-1162.
[17] Ning Qian. “On the momentum term in gradient descent learning algorithms”. In: Neural networks: the official journal of the Interna
12 1 (1999), pp. 145–151. URL: https://api.semanticscholar.org/CorpusID:2783597.
[18] Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics
22.3 (1951), pp. 400–407.
[19] Richard S. Sutton. “Two problems with backpropagation and other steepest-descent learning procedures for networks”.
In: (1986). DOI: https://escholarship.org/uc/item/2z66t36g.
[20] Caglar Gulcehre Kyunghyun Cho Surya Ganguli Yann N. Dauphin Razvan Pascanu and Yoshua Bengio. “Identifying
and attacking the saddle point problem in high-dimensional nonconvex optimization”. In: arXiv (2014), pp. 1–14.
[21] Wojciech Zaremba and Ilya Sutskever. “Learning to Execute”. In: (Oct. 2014).
[22] Sixn Zhang, Anna Choromanska, and Yann Lecun. “Deep learning with Elastic Averaging SGD”. In: (Dec. 2014).

Access Calculus A Complete Course Canadian 8th Edition Adams Test Bank All Chapters Immediate PDF Download
100% (18)
Access Calculus A Complete Course Canadian 8th Edition Adams Test Bank All Chapters Immediate PDF Download
66 pages
Audit Manager
No ratings yet
Audit Manager
125 pages
Cornell ECE 5790: RF Integrated Circuit Design Assignment 3
No ratings yet
Cornell ECE 5790: RF Integrated Circuit Design Assignment 3
5 pages
The Ultimate Guide To Automation Testing: Joe Colantonio - Testtalks - Guild Conferences
No ratings yet
The Ultimate Guide To Automation Testing: Joe Colantonio - Testtalks - Guild Conferences
20 pages
Introduction To Computing (CS-1071) - Updated With Lab Specification
No ratings yet
Introduction To Computing (CS-1071) - Updated With Lab Specification
14 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Optimizer
No ratings yet
Optimizer
13 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Gradient Descent a Fundamental Optimization Algorithm
No ratings yet
Gradient Descent a Fundamental Optimization Algorithm
30 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Module 2
No ratings yet
Module 2
67 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
UNIT2
No ratings yet
UNIT2
25 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
cours5
No ratings yet
cours5
23 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
optim
No ratings yet
optim
33 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
File Allocation Strategies Explained
No ratings yet
File Allocation Strategies Explained
3 pages
The Poems of Goethe Translated in The Original Metres by Goethe, Johann Wolfgang Von, 1749-1832
No ratings yet
The Poems of Goethe Translated in The Original Metres by Goethe, Johann Wolfgang Von, 1749-1832
363 pages
Assessment Pack - ICTWEB201 - OD
No ratings yet
Assessment Pack - ICTWEB201 - OD
10 pages
Addition of Fractions
No ratings yet
Addition of Fractions
10 pages
K Medoids
No ratings yet
K Medoids
10 pages
Boost Charger 200: Ultrafast EV Charging With Integrated Energy Storage
No ratings yet
Boost Charger 200: Ultrafast EV Charging With Integrated Energy Storage
2 pages
Internet Data Plan: 3G Prepaid Broadband Promotional Offer Tariff/Charges
No ratings yet
Internet Data Plan: 3G Prepaid Broadband Promotional Offer Tariff/Charges
5 pages
Wireless
No ratings yet
Wireless
1 page
Dbms Solutions PDF
No ratings yet
Dbms Solutions PDF
7 pages
Dilations Translations PDF
0% (1)
Dilations Translations PDF
5 pages
Computer Office Application
No ratings yet
Computer Office Application
56 pages
BD 202
No ratings yet
BD 202
67 pages
Tle Exam - Basic Electronics
No ratings yet
Tle Exam - Basic Electronics
12 pages
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
No ratings yet
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
42 pages
A Study On Stress Management in The Organization
No ratings yet
A Study On Stress Management in The Organization
57 pages
DataMigration AX2012
No ratings yet
DataMigration AX2012
29 pages
3 - ANN Part One PDF
No ratings yet
3 - ANN Part One PDF
30 pages
Barangay System
56% (9)
Barangay System
8 pages
Lecture 17 - Elements of Microsoft Excel Formulas
No ratings yet
Lecture 17 - Elements of Microsoft Excel Formulas
10 pages
LM11 Introduction To Big Data Techniques IFT Notes
No ratings yet
LM11 Introduction To Big Data Techniques IFT Notes
7 pages
6AS7
No ratings yet
6AS7
2 pages
Battery Balancing Unit
No ratings yet
Battery Balancing Unit
6 pages
Tipos de Ficheiro
No ratings yet
Tipos de Ficheiro
50 pages
Brosur Garmin 2018
No ratings yet
Brosur Garmin 2018
49 pages
Multiphase Equilibria Calculation by Direct Minimization
No ratings yet
Multiphase Equilibria Calculation by Direct Minimization
23 pages

Technical_writing (2)

Uploaded by

Technical_writing (2)

Uploaded by

An overview of gradient descent optimization algorithms1

School of Information and Technology, Hanoi University of Science and Technology

2 Gradient Descent Variants

In code, batch gradient descent might look like this:

2.2 Stochastic Gradient Descent

θ = θ − η · ∇θ J(θ; x(i) ; y (i) ) (2)

2.3 Mini-Batch Gradient Descent

4 Gradient descent optimization algorithms

Figure 2: Source: Genevieve B. Orr

vt = γvt−1 + η∇θ J(θ)

Here, the momentum term γ is typically set to 0.9 or a similar value.

4.2 Nesterov Accelerated Gradient

vt = γvt−1 + η∇θ J(θ − γvt−1 )

performance in recurrent neural networks (RNNs) for various tasks [3].9 ,

Sutskever gives a more detailed overview in his PhD thesis [8].

E[g 2 ]t = 0.9E[g 2 ]t−1 + 0.1gt2

5 Parallelizing and distributing SGD

5.3 Delay-tolerant Algorithms for SGD

5.5 Elastic Averaging SGD

6 Additional strategies for optimizing SGD

6.1 Shuffling and Curriculum Learning

6.2 Batch normalization

6.4 Gradient noise

gt,i = gt,i + N (0, σt2 ) (11)

You might also like