0% found this document useful (0 votes)
4 views

Adafactor - Adaptive Learning Rates With Sublinear Memory Cost

The document presents Adafactor, an adaptive optimization algorithm designed to reduce memory usage during neural network training while maintaining performance similar to existing methods like Adam. It achieves this by utilizing a factored representation of the second moment accumulators, which allows for significant memory savings from O(nm) to O(n + m). Additionally, the paper addresses issues related to training stability and proposes techniques for update clipping and decay rate adjustments.

Uploaded by

Nuri Taş
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Adafactor - Adaptive Learning Rates With Sublinear Memory Cost

The document presents Adafactor, an adaptive optimization algorithm designed to reduce memory usage during neural network training while maintaining performance similar to existing methods like Adam. It achieves this by utilizing a factored representation of the second moment accumulators, which allows for significant memory savings from O(nm) to O(n + m). Additionally, the paper addresses issues related to training stability and proposes techniques for update clipping and decay rate adjustments.

Uploaded by

Nuri Taş
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer 1 Mitchell Stern 1 2

Abstract obtained through summation as in Adagrad (Duchi et al.,


2011) or exponential averaging as in RMSProp (Tieleman &
In several recently proposed stochastic optimiza- Hinton, 2012), Adam (Kingma & Ba, 2015), and Adadelta
tion methods (e.g. RMSProp, Adam, Adadelta),
arXiv:1804.04235v1 [cs.LG] 11 Apr 2018

(Zeiler, 2012). On convex problems, several of these meth-


parameter updates are scaled by the inverse ods offer theoretical advantages over SGD when gradients
square roots of exponential moving averages of are sparse. While convergence guarantees have not yet been
squared past gradients. Maintaining these per- provided in the dense, non-convex setting in which most
parameter second-moment estimators requires neural network training takes place, practitioners have nev-
memory equal to the number of parameters. For ertheless found these methods to empirically outperform
the case of neural network weight matrices, we SGD across a variety of domains.
propose maintaining only the per-row and per-
column sums of these moving averages, and esti- The superior performance of these methods does come at a
mating the per-parameter second moments based cost. Recent improvements in the computational capacity
on these sums. We demonstrate empirically that needed to train neural networks with larger numbers of pa-
this method produces similar results to the base- rameters have far outstripped improvements in the memory
line. Secondly, we show that adaptive methods capacity required to store those parameters during training.
can produce larger-than-desired updates when the This has led to memory usage becoming an important con-
decay rate of the second moment accumulator is straint on model size. Adaptive optimization algorithms
too slow. We propose update clipping and a grad- exacerbate this problem by requiring additional memory for
ually increasing decay rate scheme as remedies. extra accumulators, such as those required for momentum
Combining these methods and dropping momen- and per-coordinate gradient scaling. For example, Adam
tum, we achieve comparable results to the pub- (Kingma & Ba, 2015) keeps two additional values for each
lished Adam regime in training the Transformer parameter, tripling the memory requirements.
model on the WMT 2014 English-German ma- We propose a way to reduce memory usage while retaining
chine translation task, while using very little aux- the empirical benefits of adaptivity by maintaining a fac-
iliary storage in the optimizer. Finally, we propose tored representation of the squared gradient accumulator
scaling the parameter updates based on the scale across training steps. Specifically, by tracking moving aver-
of the parameters themselves. ages of the row and column sums of the squared gradients
for matrix-valued variables, we are able to reconstruct a
low-rank approximation of the exponentially smoothed ac-
1. Introduction and Background cumulator at each training step that is optimal with respect to
the generalized Kullback-Leibler divergence. For an n × m
Gradient-based optimization forms the backbone of most
matrix, this reduces the memory requirements from O(nm)
modern approaches used to train deep neural networks.
to O(n + m). We demonstrate empirically using Adam on a
One of the simplest methods is stochastic gradient descent
large-scale machine translation task known for its expensive
(SGD), wherein steps are taken along the direction of the
models that our approach achieves comparable performance
negative gradient of the loss function evaluated on a mini-
to that obtained using full accumulators.
batch. Building on this foundation, a variety of adaptive
gradient-based methods have been proposed in which the Beyond this, we also investigate another issue related to
gradient is divided by the componentwise square root of a Adam of recent interest. To further reduce memory require-
vector summarizing the history of squared gradients, usually ments, we would like to run Adam without momentum,
1
eliminating an additional auxiliary value per model param-
Google Brain, Mountain View, California, USA 2 University of eter. But without making any other changes, eliminating
California, Berkeley, California, USA. Correspondence to: Noam
Shazeer <[email protected]>. momentum can cause training instability. We identify out-
of-date second moment accumulators as a possible cause of
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

this instability and propose two remedies. models continue to grow, the storage requirements of one or
two auxiliary parameters per model parameter imposed by
Finally, while the learning rate in Adam denotes a target ab-
existing adaptive methods can be prohibitive, motivating the
solute step size, we follow the intuition that relative change
investigation of a low-memory alternative. In this section,
in the parameters is more relevant, so we propose scaling
we propose a novel approach in which model structure is
the size of the updates relative to the scale of the parameters
exploited in order to reduce storage requirements without
themselves.
compromising empirical performance.

2. A Brief Review of Adam Suppose a subset of the model’s parameters are arranged in
a matrix, e.g. for use in a linear transformation. We denote
this subset by W ⊆ x with W ∈ Rn×m . Under standard
Algorithm 1 Adam (Kingma & Ba, 2015)
practice, we would need to maintain an exponential moving
1: Inputs: initial point x0 , step sizes {αt }T
t=1 , first mo- average V ∈ Rn×m of the corresponding square gradients
ment decay β1 , second moment decay β2 , regularization (∇W f (x))2 for use in an adaptive update rule.
constant 
2: Initialize m0 = 0 and v0 = 0 In cases where storing the full moving average is infeasible,
3: for t = 1 to T do we might instead seek to store moving averages of some
4: gt = ∇ft (xt−1 ) low-rank matrices R ∈ Rn×k and S ∈ Rk×m with k 
5: mt = β1 mt−1 + (1 − β1 )gt n, m such that V ≈ RS at each step. We note that in
6: vt = β2 vt−1 + (1 − β2 )gt2 general, moving averages of instantaneous factors of V may
7: m̂t = mt /(1 − β1t ) differ from instantaneous factors of the moving average,
8: v̂t = vt /(1 − β2t ) so standard techniques for low-rank approximation may
√ not necessarily be applicable. We would also like these
9: xt = xt−1 − αt m̂t /( v̂t + )
10: end for quantities to be fast to compute so that the approximation
step does not become a bottleneck in the overall training
procedure.
We reproduce the pseudocode for the Adam optimizer in
Algorithm 1 for reference (Kingma & Ba, 2015). The setup One common choice for low-rank approximation is to trun-
of the problem is as follows. Suppose we are trying to cate the singular value decomposition at the top k singular
minimize the expected value of a noisy objective function values. This is known to give the optimal projection onto the
f (x). At each step, we receive a stochastic realization ft , space of rank-k matrices with respect to the Frobenius norm
e.g. the loss computed on a random minibatch of data, and (Eckart & Young, 1936). While heavily tuned procedures
we compute the gradient gt of this function with respect to exist for finding the top k singular values and vectors of a
our previous parameters. We then update the exponential matrix, these quantities in general do not decompose over
running averages of the first and second moments of the matrix addition, implying an incompatibility with exponen-
gradient mt and vt , compute bias-corrected versions m̂t tial smoothing. Moreover, there is no guarantee that the
and v̂t to account for the zero initialization, and finally entries of the approximation will be nonnegative, which is
make a parameter update to obtain a new iterate xt . This problematic given that we would like to scale the gradient
repeats for T steps, at which point we return the final iterate by the componentwise inverse square root.
xT as our approximate solution. In search of a more suitable alternative, we turn to tech-
The step size αt is often held constant over the course of niques from nonnegative matrix factorization. In addition
training, but recent work in large-scale optimization sug- to the Frobenius norm, another popular cost function in the
gests that performance can be improved on some problems literature is the generalized Kullback-Leibler divergence,
through a linear ramp-up followed by some form of decay also known as the I-divergence (Lee & Seung, 1999). For
(Goyal et al., 2017; Vaswani et al., 2017). We use the latter nonnegative scalar inputs, the I-divergence is given by the
with an inverse square root decay scheme in our experiments, equation
finding it to yield more stable results. p
d(p, q) = p log − p + q,
q
3. Factored Second Moment Estimation
Recent work has shown that for problems where vast quan- with the conventions that 0/0 = 0, 0 log 0 = 0, and
tities of data are available, e.g. language modeling and ma- p/0 = ∞ for p > 0. It is easily seen that d(p, q) ≥ 0
chine translation, task performance improves consistently with equality iff p = q by setting x = p/q in the standard
as model size increases, even in the regime of models with inequality x log x ≥ x − 1. Under this cost function, we
several billions of parameters (Shazeer et al., 2017). As aim to minimize the total elementwise divergence subject to
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

componentwise nonnegativity constraints: Algorithm 2 Adam for a matrix parameter X with factored
n X
m second moments and first moment decay parameter β1 = 0.
X
minimize d(Vij , [RS]ij ) 1: Inputs: initial point X0 ∈ Rn×m , step sizes {αt }T
t=1 ,
R∈Rn×k ,S∈Rk×m (1)
i=1 j=1 second moment decay β2 , regularization constant 
subject to Rij ≥ 0, Sij ≥ 0. 2: Initialize R0 = 0 and C0 = 0
3: for t = 1 to T do
Solving this problem for general rank-k factors is nontrivial, 4: Gt = ∇ft (Xt−1 )
requiring for instance the use of an alternating minimization 5: Rt = β2 Rt−1 + (1 − β2 )(G2t )1m
procedure (Finesso & Spreij, 2006). In the special case of Ct = β2 Ct−1 + (1 − β2 )1> 2
6: n (Gt )
rank-1 factors, however, we can derive an analytic solution. 7: > t
V̂t = (Rt Ct /1n Rt )/(1p− β2 )
Lemma 1. The solution set of the optimization problem (1) 8: Xt = Xt−1 − αt Gt /( V̂t + )
when k = 1 consists of all feasible pairs (R, S) satisfying 9: end for
RS = V 1m 1> >
n V /1n V 1m , where 1` = (1, . . . , 1) ∈ R
`

denotes a column vector of ` ones.

Proof. Let R and S be any feasible solution. Noting that


We now note some important properties of this rank-1 pro-
[RS]ij = Ri Sj and expanding the loss, we have
jection. First, if V itself is a rank-1 matrix, then it will be
n X
m
X exactly recovered as one would expect. Second, the pro-
d(Vij , [RS]ij ) jection can be expressed entirely in terms of the row sums
i=1 j=1 V 1m and column sums 1> n V , which in particular are linear
n X m  
X Vij functions of V . This convenient fact gives us the desired
= Vij log − Vij + [RS]ij compatibility with exponential smoothing, since the row
i=1 j=1
[RS]ij
sums of the moving average equal the moving average of
n X m m
n X
X X the row sums, and similarly for columns. Moreover, storing
= Vij log Vij − Vij log Ri
only the moving averages of these factors rather than the
i=1 j=1 i=1 j=1
n X m n Xm m
n X
full matrix V yields considerable memory savings, requiring

X
Vij log Sj −
X
Vij +
X
Ri Sj . space proportional to n + m rather than nm.
i=1 j=1 i=1 j=1 i=1 j=1 We present a concrete implementation of Adam with fac-
Setting the derivatives of this expression with respect to Ri tored second moment accumulators in Algorithm 2 for the
and Sj equal to 0, we obtain the relations case where the parameter set x can be viewed as a single
Pm matrix X. In the event that the parameter set is most suit-
m m
Vij X j=1 Vij ably partitioned into multiple matrices (treating vectors and
X
− + Sj = 0 =⇒ Ri = Pm ,
R i j=1 Sj
scalars as special cases), the steps can be performed in par-
j=1 j=1
n n Pn allel for each matrix individually. We present the algorithm
X Vij X Vij with β1 fixed at 0 so as to focus our attention on the sec-
− + Ri = 0 =⇒ Sj = Pi=1 n .
i=1
Sj i=1 i=1 Ri ond moments. First moments can be included as in Adam
without modification if desired.
Now note that for any minimizer (R, S), the solution
(αR, S/α) is also a minimizer for any α > 0, since the In the implementation, we keep running averages of the row
loss only depends on the product RS. Hence we may break sums Rt and column sums Ct of the squared gradients. The
full accumulator is then approximated as the outer product
thePsymmetry by Pmthe sum of the components of R
Pnfixing divided by the sum of all entries, Rt Ct /1>
n Rt , and is sub-
n
at i=1 Ri = i=1 j=1 Vij , in which case we obtain a
canonical minimizer sequently scaled by the same bias correction factor as in
m Pn Adam. We note that the normalization term in the denomi-
V nator 1>
Pm ij n Rt could equivalently be expressed as Ct 1m , so the
X
Ri = Vij , Sj = Pn i=1
j=1 i=1 j=1 Vij treatment of row sums and column sums is not asymmetric
despite the surface form of the approximation.
or in vector form,
1>nV 3.1. Relation to Prior Work
R = V 1m , S= >
.
1n V 1m A preliminary version of this method was briefly mentioned
By our discussion of symmetry above, it follows that the so- in Appendix D of Shazeer et al. (2017). Also, Gupta et al.
lution set consists more broadly of all pairs (R, S) satisfying (2014) employ a similar technique, saving memory by aver-
RS = V 1m 1> >
n V /1n V 1m , and the claim follows. aging Adagrad accumulators across embedding vectors.
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

3.2. Experiments
Table 1. BLEU scores for Transformer machine translation models
We ran the Transformer model from Vaswani et al. (2017), trained with slow (β2 = 0.999) and fast (β2 = 0.9) second-
using Adam with and without our factored second moment moment decay, with and without step size warm-up. Fast decay
estimation for optimization. See Section 9 for more details has convergence problems. Slow decay has stability problems.
Excerpted from Table 2 rows (A), (G).
on the experimental setup. Results were similar in all tested
cases. See Table 2 (A) vs. (C) and (H) vs. (J).
β2 With warm-up No warm-up
We also tried simplified estimation schemes where the 0.999 25.6 0.1
second-moment estimators for matrices were approximated 0.9 18.4 15.6
by either the row means or the column means (but not their
outer product). For this model, the results for the row-mean
scheme were similar to baseline, but the results for the col- dients farther in the past. If the model is evolving rapidly,
umn mean scheme were much worse. See Table 2 (D) and this could cause the estimates to have high error, leading
(E). We suspect that these results are due to the model’s to smaller-than-desired or (worse) larger-than-desired up-
use of a shared weight matrix used both to represent the dates. To check whether this is happening, we observe the
token embeddings and to produce the output probabilities. root-mean-square over all parameters x in a weight matrix
Each row in this matrix corresponds to one token in the or vector X for a given √timestep t of the unscaled parame-
vocabulary. Rows associated with very frequent tokens tend ter update uxt = −gxt / v̂xt . For brevity, we refer to this
to receive gradients of much larger magnitude than rows quantity as RMS(Ut ):
associated with very infrequent tokens. s
(gxt )2
 
RMS(Ut ) = RMSx∈X (uxt ) = Meanx∈X .
4. No Momentum v̂xt
Adam requires two persistent accumulators per parame- If Adam is functioning as intended, for each individual
ter for the first and second moments of the gradients. In parameter x, the value v̂xt should be close to (gxt )2 , since
Section 3, we reduced the memory requirements of the this is precisely what v̂xt is designed to measure. Thus, the
second-moment accumulator. To remove the need for a ratio (gxt )2 /v̂xt should be close to 1, as should the mean
first-moment accumulator, we simply turn momentum off of many such values. So for a large weight matrix X, a
by setting β1 = 0. value of RMS(Ut ) which is far from 1 is a sign that the
second-moment estimator is not doing its job well.
4.1. Experiments
In Figure 1, we plot RMS(Ut ) for one particular weight ma-
For a step size schedule similar to the one used in Vaswani trix in a Transformer machine translation model (Vaswani
et al. (2017), which includes a warmup period, model quality et al., 2017) for training runs with β2 = 0.9 and β2 = 0.999.
is similar without and with momentum (BLEU = 23.6 vs. With fast decay (red), RMS(Ut ) stays close to 1 as expected,
23.4) – see Table 2 (A) vs. (B), second to last column. while with slow decay (blue), it fluctuates significantly. Val-
Without the warmup period, the model without momentum ues larger than 1 indicate larger-than-desired parameter up-
becomes more unstable (BLEU = 0.1 vs. 23.1) – see Table 2 dates.
(A) vs. (B), last column. We hypothesize that removing The fact that slow decay causes both larger-than-desired
the momentum unmasks an underlying problem with the updates and training instability supports our hypothesis that
stability of Adam, which we will discuss in the next section. the large updates are the cause of the instability, but does
not prove it. One competing hypothesis is that the instability
5. A Problem with Adam: Out-of-Date causes the larger-than-desired updates. We refute this par-
ticular competing hypothesis by noting that the RM S(Ut )
Second Moment Estimator
values plotted in Figure 1 are for training runs with step
Reddi et al. (2018) discuss non-convergence issues when size warmup, neither of which exhibited instability. In the
using a fast decay of the second-moment estimator (low next section, we further support our hypothesis by showing
β2 ). We observe the same issues in our experiments – see that we can cure the instability by clipping the larger-than-
Table 1, first result column. On the other hand, slow decay desired updates.
(high β2 ) causes training instability when we turn off the
step size warmup – see Table 1, second result column. 6. Update Clipping
We explain the instability as follows: A slow decay rate
To remove the larger-than-desired updates described in Sec-
means that our second-moment estimator is based on gra-
tion 5, we propose scaling down the updates on a weight
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

7. Increasing Decay Parameter


An alternative solution to the problems described in Sec-
tion 5 is to use an increasing schedule of β2 , as proposed by
Reddi et al. (2018). Perhaps this can give us the best of both
worlds – see Table 1, where different decay rates are better
in different situations.

7.1. In Adam
We point out here that Adam already uses an increasing
decay parameter if we rewrite the bias correction as a cor-
1−β2t−1
rection to β2 . To do this, we define βˆ2 = β2 tt , and
1−β2
we compute v̂t directly in terms of v̂t−1 as follows:

vt β2 vt−1 + (1 − β2 )gt2
v̂t = =
1 − β2t 1 − β2t
Figure 1. With slow decay (β2 = 0.999), the second-moment
estimator is out of date, leading to parameter updates that are β2 (1 − β2t−1 ) 1 − β2 2
= v̂t−1 + g
larger or smaller than the intended value. 1 − β2t 1 − β2t t
(1 − β2t ) − (β2 − β2t ) 2
= βˆ2 t v̂t−1 + gt
1 − β2t
vector or matrix X whenever RMS(Ut ) exceeds a threshold
β2 (1 − β2t−1 )
 
value d. We define the clipped unscaled update Ût as: = βˆ2 t v̂t−1 + 1 − gt2
1 − β2t
Ut = βˆ2 v̂t−1 + (1 − βˆ2 )g 2 .
t t t
Ût =
max (1, RMS(Ut )/d)
This, along with similar logic for β1 , leads to the alternative
formulation of Adam in Algorithm 3.
The actual parameter update is then the product αt Ût of the
step size and the clipped unscaled update, as in Algorithm 4. Algorithm 3 Equivalent formulation of Adam where bias
adjustments have been replaced by decay-rate adjustments.
6.1. Comparison to Gradient Clipping 1: Inputs: initial point x0 , step sizes {αt }T
t=1 , first mo-
Gradient clipping is a popular heuristic used for training ment decay β1 , second moment decay β2 , regularization
neural networks in which the gradient is scaled down before constant 
an update if needed to ensure that its norm never exceeds 2: for t = 1 to T do
some fixed threshold (Pascanu et al., 2013). For stochastic 3: gt = ∇ft (xt−1 )
1−β1t−1
gradient descent, the update direction is exactly the gradient, 4: βˆ1 = β1
t t 1−β1
so this also has the effect of putting an upper bound on 1−β t−1
5: βˆ2 t = β2 1−β2 t
the distance traveled in each step. While gradient clipping 2

is also applied to adaptive methods in practice, the norm 6: m̂t = βˆ1 t m̂t−1 + (1 − βˆ1 t )gt
of the update direction may still exceed the user-imposed 7: v̂t = βˆ2 t v̂t−1 + (1 −√
βˆ2 t )gt2
threshold due to the presence of additional per-parameter 8: xt = xt−1 − αt m̂t /( v̂t + )
scaling factors. In update clipping, we cap the norm of the 9: end for
actual update rather than just the gradient.
In our reformulation of Adam, the corrected decay parame-
6.2. Experiments 1−β2t−1
ter βˆ2 = β2
t t starts at 0 when t = 1 and asymptoti-
1−β2
We added update clipping to the previously described fast- cally approaches β2 for large values of t.
decay experiments. For the experiment without learning rate
warmup, update clipping with d = 1 significantly amelio- 7.2. Proposed Alternative
rated the instability problem – see Table 2 (A) vs. (H). With
d = 2, the instability was not improved. Update clipping Alternatively, we propose the family of schedules
did not significantly affect the experiments with warmup 1
(with no instability problems). βˆ2 t = 1 − c , t≥1
t
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

t−1 t−1
parameterized by a scalar c > 0 controlling the rate of X Y
increase. = βˆ2 t (1 − βˆ2 i ) βˆ2 j + (1 − βˆ2 t )
i=1 j=i+1
By inspection, it is clear that this schedule starts at 0 for
= βˆ2 t + (1 − βˆ2 t ) = 1,
t = 1 and increases toward 1 as t tends to ∞. This allows
us to benefit from the stability properties of a low βˆ2 t at the which completes the argument. We remark that this proof
start of training while still realizing the gains in performance in fact goes through for any schedule for which βˆ2 1 = 0.
due to a high βˆ2 t as the run progresses.
The second condition is more restrictive in comparison. For
Less obviously, this schedule also eliminates the need for the proposed schedule, we would like it to be true that
bias correction. To see why, we begin by expanding the
recursive definition of vt to arrive at  
1
 Y t 
1

t t lim 1− 1− c 1− c
X Y t→∞ i j
vt = (1 − βˆ2 i ) βˆ2 j gi2 . 
j=i+1
−1
i=1 j=i+1 i   t  
1 Y 1 Y 1
Taking expectations of both sides, we have = c 1 − c  lim 1− c =0
i j=2
j t→∞
j=2
j
 
X t Yt
E[vt ] = E  (1 − βˆ2 i ) βˆ2 j gi2  for all i ≥ 1. Using the standard Q result that for a sequence
i=1 j=i+1
P n (1 − an ) converges to
0 ≤ an < 1, the infinite product
t
X t
Y a nonzero value iff the series n an converges,
P∞ we see that
= (1 − βˆ2 i ) βˆ2 j E[gi2 ] the limit above will be 0 iff the series j=2 1/j c diverges,
i=1 j=i+1 which is only true for c ≤ 1. Hence the decay parameter
t
X t
Y must not increase too fast, as otherwise past gradients will
= (1 − βˆ2 i ) βˆ2 j E[gt2 ] maintain a weight bounded away from 0 for the full duration
i=1 j=i+1 Pt In2the special case where c = 1, we note that
of training.
t
X t
Y vt = i=1 gi /t reduces to a simple arithmetic moving
+ (1 − βˆ2 i ) βˆ2 j (E[gi2 ] − E[gt2 ]). average of the history of squared gradients.
i=1 j=i+1

We would like the expected moving average E[vt ] to be 7.3. Experiments


as close as possible to the true second moment E[gt2 ]. If We added this alternative to our experimental baseline – see
we assume as in Kingma & Ba (2015) that the gradient Table 2 lines (A) vs. (K), (L), (M). The schedule βˆ2 t =
distribution is stationary or that the errors E[gi2 ] − E[gt2 ] are 1 − t−0.5 did in fact maintain both stability and convergence.
sufficiently small, then it suffices to check that our proposed When combined with update clipping, this method produced
decay schedule satisfies similar results to constant high β2 with update clipping –
t
X t
Y see Table 2 lines (H) vs. (N).
(1 − βˆ2 i ) βˆ2 j = 1,
i=1 j=i+1
8. Relative Step Size
since this would imply E[vt ] and E[gt2 ] are equal in the
Instead of defining the optimization algorithm in terms of
stationary case or equal up to a small error term otherwise.
absolute step sizes {αt }Tt=1 , we propose defining the opti-
We will also require that for all i ≥ 1,
mization algorithm in terms of relative step sizes {ρt }Tt=1 ,
t
Y which get multiplied by the scale of the parameters. We
lim (1 − βˆ2 i ) βˆ2 j = 0, define the scale of a parameter vector or matrix as the root-
t→∞
j=i+1 mean-square of its components, lower-bounded by a small
which means that the contributions of past gradients will go constant 2 . The reason for this lower bound is to allow zero-
to 0 as training progresses rather than retaining nontrivial initialized parameters to escape 0. Combining this with the
weight for all time. other proposals in this paper gives the Adafactor algorithm
defined in Algorithms 4 and 5. Proposed hyperparameters
We verify the first property with a simple induction argu- for Adafactor are listed in Algorithm 6.
ment. At time t = 1, we have 1 − βˆ2 1 = 1 as desired. Then
if the equality holds at time t − 1, we have
8.1. Experiments
t
X t
Y
(1 − βˆ2 i ) βˆ2 j To examine the potential benefit of relative step size, we use
i=1 j=i+1 a version of Transformer (Vaswani et al., 2017) where the
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Table 2. BLEU scores for Transformer on WMT ’14 En → De translation task (higher is better). Each optimization scheme was tested with
and without a warmup period. For the tests with warmup, st = min(10−6 · t, √1t ). For the tests without warmup, st = min(10−2 , √1t ).

Factored Update (Relative) BLEU BLEU


Second-Moment βˆ1 t βˆ2 t Clipping Step with warmup no warmup
Estimation d Size
(A) 0 β2 = 0.999 25.6 0.1
αt = 0.1 · st
(B) 0.9 β2 = 0.999 25.4 23.1
(C) yes 0 β2 = 0.999 25.4 0.2
(D) use row-mean 0 β2 = 0.999 αt = 0.1 · st 25.2 0.3
(E) use col-mean 0 β2 = 0.999 0.3 0.5
(F) 0 β2 = 0.99 25.0 0.4
αt = 0.1 · st
(G) 0 β2 = 0.9 18.4 15.6
(H) 0 β2 = 0.999 1.0 25.4 21.5
(I) 0 β2 = 0.999 2.0 25.7 0.2
(J) yes 0 β2 = 0.999 1.0 25.6 22.4
(K) 0 1 − t−0.5 25.6 21.1
(L) 0 1 − t−0.8 25.6 0.1
αt = 0.1 · st
(M) 0 1 − t−1.0 25.4 0.1
(N) 0 1 − t−0.8 1.0 25.9 22.4
(O) yes 0 1 − t−0.8 1.0 25.0 25.5
ρt = st
(P) yes 0.9 1 − t−0.8 1.0 24.9 25.3
SGD lr = 1 · st 0.6 0.8
SGD lr = 10 · st 8.2 9.1
SGD lr = 100 · st 22.9 diverged
(Q)
SGD lr = 150 · st 24.0 diverged
SGD lr = 200 · st 24.3 diverged
SGD lr = 300 · st diverged diverged

token-embedding parameters are not reused in the softmax


Table 3. Relative step sizes are more resilient to differently-scaled
layer. The authors cleverly initialize the embedding parame-
embedding parameters.
ters with standard deviation √d 1 , similarly to the other
model
parameters,√and then scale them up in the computation by Emb init. Multiplier BLEU BLEU
a factor of dmodel so that the embeddings start out with σ (Adam) (Adafactor)
unit norm. This allows the same absolute step size to work 1

√ dmodel 26.4 26.6
for both the embedding parameters and the other weight ma- dmodel

trices in the model. We test Adam and Adafactor with this 1 1 25.8 26.4
√ 1
1 24.2 25.4
“clever” embedding scheme, but also with two more naive dmodel
schemes. In the first, we initialize the embedding parame-
ters with standard deviation 1 and do not scale them in the
computation. In the second, we initialize the embedding pa-
rameters with standard deviation √d 1 , and do not scale
9. Experimental Setup
model
them in the computation. For the Adam experiments, we We evaluated the optimization algorithms described in this
use the hyperparameters and step size scheme from Vaswani paper on the Transformer machine translation model de-
et al. (2017). For the Adafactor experiments, we use our scribed in Vaswani et al. (2017) on the same WMT 2014
recommended hyperparameters listed in Algorithm 6. All English-to-German translation task described in that paper,
models are trained for 50,000 steps with batch size 16,384 using the latest version of the architecture from the Ten-
tokens (unlike the other experiments in this paper). Results sor2Tensor open-source repository.
are given in Table 3. Adafactor proves more resilient to the
Models were trained for 100,000 steps. Each training batch
more naive parameter initialization and scaling schemes.
contained sentence pairs containing approximately 4,096
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Algorithm 4 Adafactor for weight matrices. Algorithm 6 Proposed hyperparameters for Adafactor
1: Inputs: initial point X0 ∈ Rn×m , relative step sizes 1: 1 = 10−30
{ρt }Tt=1 , second moment decay {βˆ2 t }Tt=1 such that 2: 2 = 10−3
βˆ2 1 = 0, regularization constants 1 and 2 , clipping 3: d = 1  
threshold d 4: ρt = min 10−2 , √1t
2: for t = 1 to T do 5: βˆ2 = 1 − t−0.8
t
3: αt = max (2 , RMS(Xt−1 )) ρt
4: Gt = ∇ft (Xt−1 )
5: Rt = βˆ2 t Rt−1 + (1 − βˆ2 t )(G2t + 1 1n 1>m )1m
6: Ct = βˆ2 t Ct−1 + (1 − βˆ2 t )1>
n (G 2
t +  1 1 >
n 1m ) 9.1. Results
>
7: V̂t = Rt Cpt /1n Rt Results are listed in Table 2. The listed BLEU scores are
8: Ut = Gt / V̂t on the development set, newstest2013, using beam search
9: Ût = Ut / max (1, RMS(Ut )/d) with beam size 4 and length penalty α = 0.6. Higher
10: Xt = Xt−1 − αt Ût scores are better. Note that the scores listed should not be
11: end for compared to those in Vaswani et al. (2017), due to both
our shorter training regime and various improvements in
the open-source version of the model over the published
Algorithm 5 Adafactor for weight vectors. version.
1: Inputs: initial point X0 ∈ Rn , relative step sizes The schemes with warmup mostly achieved very similar
{ρt }Tt=1 , second moment decay {βˆ2 t }Tt=1 such that results. Fast decay of the second-moment estimator (G) was
βˆ2 1 = 0, regularization constants 1 and 2 , clipping significantly worse.
threshold d
Without warmup, the baseline (A) becomes unstable. The
2: for t = 1 to T do
instability is relieved by any of momentum (B), fast decay
3: αt = max (2 , RMS(Xt−1 )) ρt
(G), variable decay (K), and gradient clipping (H). It is not
4: Gt = ∇ft (Xt−1 )
clear whether relative step size has an affect on stability,
5: V̂t = βˆ2 t V̂p ˆ 2
t−1 + (1 − β2 t )(Gt + 1 1n )
since the step sizes used in the experiments are not directly
6: Ut = Gt / V̂t comparable.
7: Ût = Ut / max (1, RMS(Ut )/d)
8: Xt = Xt−1 − αt Ût Rows (J) and (N) demonstrate algorithms with sub-linear
9: end for additional memory requirements which attain comparable
convergence and stability results to Adam with momentum.
Results for SGD (Q) were poorer and less stable than Adam,
and highly dependent on choice of learning rate.
tokens in the input and 4,096 tokens in the target sentences.
These batches are about 8 times smaller than the ones used
by Vaswani et al. (2017). This causes our results to be
10. Conclusion
slightly worse, but significantly speeds up training times On a popular machine translation task, we have demon-
(less than two hours each on one Google TPU v2). strated similar quality results to Adam, using a sublinear
In one set of experiments, we followed a similar step size amount of extra space for accumulators. This should en-
schedule as Vaswani et al. (2017) consisting of a linear able training of significantly larger models on the same
warmup followed by inverse-square root decay, given by memory-constrained hardware. We have also introduced up-
αt = 0.1 · min(10−6 · t, √1t ). In order to test training date clipping, a potentially more-generally-useful technique
for stabilizing adaptive gradient methods.
stability, we ran a second set of experiments where the
initial warmup was replaced by a flat learning rate: αt = Code for running Adafactor is available in the open-source
0.1 · min(10−2 , √1t ). For the experiments with relative step Tensor2Tensor library.
sizes, we used schedules ρt = min(10−6 · t, √1t ) and ρt =
min(10−2 , √1t ). Acknowledgements
In addition, we tested plain SGD with learning rate schemes Thanks to Łukasz Kaiser, the Tensor2Tensor team and the
equal to the step size schemes above, multiplied by various open-source community for helping test and debug Adafac-
constants, since SGD also requires little (zero) additional tor. Also thanks to Geoffrey Hinton, who asserted that train-
memory cost. ing works well if the magnitudes of parameter updates are
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

about 10−2 to 10−3 times the magnitude of the parameters. Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof,
Davis, Andy, Le, Quoc, Hinton, Geoffrey, and Dean, Jeff.
References Outrageously large neural networks: The sparsely-gated
mixture-of-experts layer. In ICLR, 2017. URL https:
Duchi, John C., Hazan, Elad, and Singer, Yoram. //openreview.net/pdf?id=B1ckMDqlg.
Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Divide
Learning Research, 12:2121–2159, 2011. URL the gradient by a running average of its recent magnitude.
http://dblp.uni-trier.de/db/journals/ COURSERA: Neural Networks for Machine Learning,
jmlr/jmlr12.html#DuchiHS11. 2012.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszko-
Eckart, C. and Young, G. The approximation of one matrix
reit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser,
by another of lower rank. Psychometrika, 1(3):211–218,
Łukasz, and Polosukhin, Illia. Attention is all you
1936. doi: 10.1007/BF02288367.
need. In Guyon, I., Luxburg, U. V., Bengio, S., Wal-
lach, H., Fergus, R., Vishwanathan, S., and Garnett,
Finesso, Lorenzo and Spreij, Peter. Nonnegative
R. (eds.), Advances in Neural Information Processing
matrix factorization and I-divergence alternating
Systems 30, pp. 6000–6010. Curran Associates, Inc.,
minimization. Linear Algebra and its Applications,
2017. URL http://papers.nips.cc/paper/
416(2):270 – 287, 2006. ISSN 0024-3795. doi:
7181-attention-is-all-you-need.pdf.
https://doi.org/10.1016/j.laa.2005.11.012. URL
http://www.sciencedirect.com/science/ Zeiler, Matthew D. Adadelta: An adaptive learning
article/pii/S0024379505005665. rate method. CoRR, abs/1212.5701, 2012. URL
http://dblp.uni-trier.de/db/journals/
Goyal, Priya, Dollár, Piotr, Girshick, Ross B., Noordhuis, corr/corr1212.html#abs-1212-5701.
Pieter, Wesolowski, Lukasz, Kyrola, Aapo, Tulloch, An-
drew, Jia, Yangqing, and He, Kaiming. Accurate, large
minibatch SGD: training imagenet in 1 hour. CoRR,
abs/1706.02677, 2017. URL http://arxiv.org/
abs/1706.02677.

Gupta, Maya R., Bengio, Samy, and Weston, Jason. Train-


ing highly multiclass classifiers. Journal of Machine
Learning Research, 15:1461–1492, 2014. URL http:
//jmlr.org/papers/v15/gupta14a.html.

Kingma, Diederik and Ba, Jimmy. Adam: A method for


stochastic optimization. In ICLR, 2015.

Lee, Daniel D. and Seung, H. Sebastian. Learning the parts


of objects by nonnegative matrix factorization. Nature,
401:788–791, 1999.

Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On


the difficulty of training recurrent neural networks. In
Dasgupta, Sanjoy and McAllester, David (eds.), Proceed-
ings of the 30th International Conference on Machine
Learning, volume 28 of Proceedings of Machine Learn-
ing Research, pp. 1310–1318, Atlanta, Georgia, USA, 17–
19 Jun 2013. PMLR. URL http://proceedings.
mlr.press/v28/pascanu13.html.

Reddi, Sashank J., Kale, Satyen, and Kumar, Sanjiv. On the


convergence of adam and beyond. International Confer-
ence on Learning Representations, 2018. URL https:
//openreview.net/forum?id=ryQu7f-RZ.

You might also like