Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
this instability and propose two remedies. models continue to grow, the storage requirements of one or
two auxiliary parameters per model parameter imposed by
Finally, while the learning rate in Adam denotes a target ab-
existing adaptive methods can be prohibitive, motivating the
solute step size, we follow the intuition that relative change
investigation of a low-memory alternative. In this section,
in the parameters is more relevant, so we propose scaling
we propose a novel approach in which model structure is
the size of the updates relative to the scale of the parameters
exploited in order to reduce storage requirements without
themselves.
compromising empirical performance.
2. A Brief Review of Adam Suppose a subset of the model’s parameters are arranged in
a matrix, e.g. for use in a linear transformation. We denote
this subset by W ⊆ x with W ∈ Rn×m . Under standard
Algorithm 1 Adam (Kingma & Ba, 2015)
practice, we would need to maintain an exponential moving
1: Inputs: initial point x0 , step sizes {αt }T
t=1 , first mo- average V ∈ Rn×m of the corresponding square gradients
ment decay β1 , second moment decay β2 , regularization (∇W f (x))2 for use in an adaptive update rule.
constant
2: Initialize m0 = 0 and v0 = 0 In cases where storing the full moving average is infeasible,
3: for t = 1 to T do we might instead seek to store moving averages of some
4: gt = ∇ft (xt−1 ) low-rank matrices R ∈ Rn×k and S ∈ Rk×m with k
5: mt = β1 mt−1 + (1 − β1 )gt n, m such that V ≈ RS at each step. We note that in
6: vt = β2 vt−1 + (1 − β2 )gt2 general, moving averages of instantaneous factors of V may
7: m̂t = mt /(1 − β1t ) differ from instantaneous factors of the moving average,
8: v̂t = vt /(1 − β2t ) so standard techniques for low-rank approximation may
√ not necessarily be applicable. We would also like these
9: xt = xt−1 − αt m̂t /( v̂t + )
10: end for quantities to be fast to compute so that the approximation
step does not become a bottleneck in the overall training
procedure.
We reproduce the pseudocode for the Adam optimizer in
Algorithm 1 for reference (Kingma & Ba, 2015). The setup One common choice for low-rank approximation is to trun-
of the problem is as follows. Suppose we are trying to cate the singular value decomposition at the top k singular
minimize the expected value of a noisy objective function values. This is known to give the optimal projection onto the
f (x). At each step, we receive a stochastic realization ft , space of rank-k matrices with respect to the Frobenius norm
e.g. the loss computed on a random minibatch of data, and (Eckart & Young, 1936). While heavily tuned procedures
we compute the gradient gt of this function with respect to exist for finding the top k singular values and vectors of a
our previous parameters. We then update the exponential matrix, these quantities in general do not decompose over
running averages of the first and second moments of the matrix addition, implying an incompatibility with exponen-
gradient mt and vt , compute bias-corrected versions m̂t tial smoothing. Moreover, there is no guarantee that the
and v̂t to account for the zero initialization, and finally entries of the approximation will be nonnegative, which is
make a parameter update to obtain a new iterate xt . This problematic given that we would like to scale the gradient
repeats for T steps, at which point we return the final iterate by the componentwise inverse square root.
xT as our approximate solution. In search of a more suitable alternative, we turn to tech-
The step size αt is often held constant over the course of niques from nonnegative matrix factorization. In addition
training, but recent work in large-scale optimization sug- to the Frobenius norm, another popular cost function in the
gests that performance can be improved on some problems literature is the generalized Kullback-Leibler divergence,
through a linear ramp-up followed by some form of decay also known as the I-divergence (Lee & Seung, 1999). For
(Goyal et al., 2017; Vaswani et al., 2017). We use the latter nonnegative scalar inputs, the I-divergence is given by the
with an inverse square root decay scheme in our experiments, equation
finding it to yield more stable results. p
d(p, q) = p log − p + q,
q
3. Factored Second Moment Estimation
Recent work has shown that for problems where vast quan- with the conventions that 0/0 = 0, 0 log 0 = 0, and
tities of data are available, e.g. language modeling and ma- p/0 = ∞ for p > 0. It is easily seen that d(p, q) ≥ 0
chine translation, task performance improves consistently with equality iff p = q by setting x = p/q in the standard
as model size increases, even in the regime of models with inequality x log x ≥ x − 1. Under this cost function, we
several billions of parameters (Shazeer et al., 2017). As aim to minimize the total elementwise divergence subject to
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
componentwise nonnegativity constraints: Algorithm 2 Adam for a matrix parameter X with factored
n X
m second moments and first moment decay parameter β1 = 0.
X
minimize d(Vij , [RS]ij ) 1: Inputs: initial point X0 ∈ Rn×m , step sizes {αt }T
t=1 ,
R∈Rn×k ,S∈Rk×m (1)
i=1 j=1 second moment decay β2 , regularization constant
subject to Rij ≥ 0, Sij ≥ 0. 2: Initialize R0 = 0 and C0 = 0
3: for t = 1 to T do
Solving this problem for general rank-k factors is nontrivial, 4: Gt = ∇ft (Xt−1 )
requiring for instance the use of an alternating minimization 5: Rt = β2 Rt−1 + (1 − β2 )(G2t )1m
procedure (Finesso & Spreij, 2006). In the special case of Ct = β2 Ct−1 + (1 − β2 )1> 2
6: n (Gt )
rank-1 factors, however, we can derive an analytic solution. 7: > t
V̂t = (Rt Ct /1n Rt )/(1p− β2 )
Lemma 1. The solution set of the optimization problem (1) 8: Xt = Xt−1 − αt Gt /( V̂t + )
when k = 1 consists of all feasible pairs (R, S) satisfying 9: end for
RS = V 1m 1> >
n V /1n V 1m , where 1` = (1, . . . , 1) ∈ R
`
3.2. Experiments
Table 1. BLEU scores for Transformer machine translation models
We ran the Transformer model from Vaswani et al. (2017), trained with slow (β2 = 0.999) and fast (β2 = 0.9) second-
using Adam with and without our factored second moment moment decay, with and without step size warm-up. Fast decay
estimation for optimization. See Section 9 for more details has convergence problems. Slow decay has stability problems.
Excerpted from Table 2 rows (A), (G).
on the experimental setup. Results were similar in all tested
cases. See Table 2 (A) vs. (C) and (H) vs. (J).
β2 With warm-up No warm-up
We also tried simplified estimation schemes where the 0.999 25.6 0.1
second-moment estimators for matrices were approximated 0.9 18.4 15.6
by either the row means or the column means (but not their
outer product). For this model, the results for the row-mean
scheme were similar to baseline, but the results for the col- dients farther in the past. If the model is evolving rapidly,
umn mean scheme were much worse. See Table 2 (D) and this could cause the estimates to have high error, leading
(E). We suspect that these results are due to the model’s to smaller-than-desired or (worse) larger-than-desired up-
use of a shared weight matrix used both to represent the dates. To check whether this is happening, we observe the
token embeddings and to produce the output probabilities. root-mean-square over all parameters x in a weight matrix
Each row in this matrix corresponds to one token in the or vector X for a given √timestep t of the unscaled parame-
vocabulary. Rows associated with very frequent tokens tend ter update uxt = −gxt / v̂xt . For brevity, we refer to this
to receive gradients of much larger magnitude than rows quantity as RMS(Ut ):
associated with very infrequent tokens. s
(gxt )2
RMS(Ut ) = RMSx∈X (uxt ) = Meanx∈X .
4. No Momentum v̂xt
Adam requires two persistent accumulators per parame- If Adam is functioning as intended, for each individual
ter for the first and second moments of the gradients. In parameter x, the value v̂xt should be close to (gxt )2 , since
Section 3, we reduced the memory requirements of the this is precisely what v̂xt is designed to measure. Thus, the
second-moment accumulator. To remove the need for a ratio (gxt )2 /v̂xt should be close to 1, as should the mean
first-moment accumulator, we simply turn momentum off of many such values. So for a large weight matrix X, a
by setting β1 = 0. value of RMS(Ut ) which is far from 1 is a sign that the
second-moment estimator is not doing its job well.
4.1. Experiments
In Figure 1, we plot RMS(Ut ) for one particular weight ma-
For a step size schedule similar to the one used in Vaswani trix in a Transformer machine translation model (Vaswani
et al. (2017), which includes a warmup period, model quality et al., 2017) for training runs with β2 = 0.9 and β2 = 0.999.
is similar without and with momentum (BLEU = 23.6 vs. With fast decay (red), RMS(Ut ) stays close to 1 as expected,
23.4) – see Table 2 (A) vs. (B), second to last column. while with slow decay (blue), it fluctuates significantly. Val-
Without the warmup period, the model without momentum ues larger than 1 indicate larger-than-desired parameter up-
becomes more unstable (BLEU = 0.1 vs. 23.1) – see Table 2 dates.
(A) vs. (B), last column. We hypothesize that removing The fact that slow decay causes both larger-than-desired
the momentum unmasks an underlying problem with the updates and training instability supports our hypothesis that
stability of Adam, which we will discuss in the next section. the large updates are the cause of the instability, but does
not prove it. One competing hypothesis is that the instability
5. A Problem with Adam: Out-of-Date causes the larger-than-desired updates. We refute this par-
ticular competing hypothesis by noting that the RM S(Ut )
Second Moment Estimator
values plotted in Figure 1 are for training runs with step
Reddi et al. (2018) discuss non-convergence issues when size warmup, neither of which exhibited instability. In the
using a fast decay of the second-moment estimator (low next section, we further support our hypothesis by showing
β2 ). We observe the same issues in our experiments – see that we can cure the instability by clipping the larger-than-
Table 1, first result column. On the other hand, slow decay desired updates.
(high β2 ) causes training instability when we turn off the
step size warmup – see Table 1, second result column. 6. Update Clipping
We explain the instability as follows: A slow decay rate
To remove the larger-than-desired updates described in Sec-
means that our second-moment estimator is based on gra-
tion 5, we propose scaling down the updates on a weight
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
7.1. In Adam
We point out here that Adam already uses an increasing
decay parameter if we rewrite the bias correction as a cor-
1−β2t−1
rection to β2 . To do this, we define βˆ2 = β2 tt , and
1−β2
we compute v̂t directly in terms of v̂t−1 as follows:
vt β2 vt−1 + (1 − β2 )gt2
v̂t = =
1 − β2t 1 − β2t
Figure 1. With slow decay (β2 = 0.999), the second-moment
estimator is out of date, leading to parameter updates that are β2 (1 − β2t−1 ) 1 − β2 2
= v̂t−1 + g
larger or smaller than the intended value. 1 − β2t 1 − β2t t
(1 − β2t ) − (β2 − β2t ) 2
= βˆ2 t v̂t−1 + gt
1 − β2t
vector or matrix X whenever RMS(Ut ) exceeds a threshold
β2 (1 − β2t−1 )
value d. We define the clipped unscaled update Ût as: = βˆ2 t v̂t−1 + 1 − gt2
1 − β2t
Ut = βˆ2 v̂t−1 + (1 − βˆ2 )g 2 .
t t t
Ût =
max (1, RMS(Ut )/d)
This, along with similar logic for β1 , leads to the alternative
formulation of Adam in Algorithm 3.
The actual parameter update is then the product αt Ût of the
step size and the clipped unscaled update, as in Algorithm 4. Algorithm 3 Equivalent formulation of Adam where bias
adjustments have been replaced by decay-rate adjustments.
6.1. Comparison to Gradient Clipping 1: Inputs: initial point x0 , step sizes {αt }T
t=1 , first mo-
Gradient clipping is a popular heuristic used for training ment decay β1 , second moment decay β2 , regularization
neural networks in which the gradient is scaled down before constant
an update if needed to ensure that its norm never exceeds 2: for t = 1 to T do
some fixed threshold (Pascanu et al., 2013). For stochastic 3: gt = ∇ft (xt−1 )
1−β1t−1
gradient descent, the update direction is exactly the gradient, 4: βˆ1 = β1
t t 1−β1
so this also has the effect of putting an upper bound on 1−β t−1
5: βˆ2 t = β2 1−β2 t
the distance traveled in each step. While gradient clipping 2
is also applied to adaptive methods in practice, the norm 6: m̂t = βˆ1 t m̂t−1 + (1 − βˆ1 t )gt
of the update direction may still exceed the user-imposed 7: v̂t = βˆ2 t v̂t−1 + (1 −√
βˆ2 t )gt2
threshold due to the presence of additional per-parameter 8: xt = xt−1 − αt m̂t /( v̂t + )
scaling factors. In update clipping, we cap the norm of the 9: end for
actual update rather than just the gradient.
In our reformulation of Adam, the corrected decay parame-
6.2. Experiments 1−β2t−1
ter βˆ2 = β2
t t starts at 0 when t = 1 and asymptoti-
1−β2
We added update clipping to the previously described fast- cally approaches β2 for large values of t.
decay experiments. For the experiment without learning rate
warmup, update clipping with d = 1 significantly amelio- 7.2. Proposed Alternative
rated the instability problem – see Table 2 (A) vs. (H). With
d = 2, the instability was not improved. Update clipping Alternatively, we propose the family of schedules
did not significantly affect the experiments with warmup 1
(with no instability problems). βˆ2 t = 1 − c , t≥1
t
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
t−1 t−1
parameterized by a scalar c > 0 controlling the rate of X Y
increase. = βˆ2 t (1 − βˆ2 i ) βˆ2 j + (1 − βˆ2 t )
i=1 j=i+1
By inspection, it is clear that this schedule starts at 0 for
= βˆ2 t + (1 − βˆ2 t ) = 1,
t = 1 and increases toward 1 as t tends to ∞. This allows
us to benefit from the stability properties of a low βˆ2 t at the which completes the argument. We remark that this proof
start of training while still realizing the gains in performance in fact goes through for any schedule for which βˆ2 1 = 0.
due to a high βˆ2 t as the run progresses.
The second condition is more restrictive in comparison. For
Less obviously, this schedule also eliminates the need for the proposed schedule, we would like it to be true that
bias correction. To see why, we begin by expanding the
recursive definition of vt to arrive at
1
Y t
1
t t lim 1− 1− c 1− c
X Y t→∞ i j
vt = (1 − βˆ2 i ) βˆ2 j gi2 .
j=i+1
−1
i=1 j=i+1 i t
1 Y 1 Y 1
Taking expectations of both sides, we have = c 1 − c lim 1− c =0
i j=2
j t→∞
j=2
j
X t Yt
E[vt ] = E (1 − βˆ2 i ) βˆ2 j gi2 for all i ≥ 1. Using the standard Q result that for a sequence
i=1 j=i+1
P n (1 − an ) converges to
0 ≤ an < 1, the infinite product
t
X t
Y a nonzero value iff the series n an converges,
P∞ we see that
= (1 − βˆ2 i ) βˆ2 j E[gi2 ] the limit above will be 0 iff the series j=2 1/j c diverges,
i=1 j=i+1 which is only true for c ≤ 1. Hence the decay parameter
t
X t
Y must not increase too fast, as otherwise past gradients will
= (1 − βˆ2 i ) βˆ2 j E[gt2 ] maintain a weight bounded away from 0 for the full duration
i=1 j=i+1 Pt In2the special case where c = 1, we note that
of training.
t
X t
Y vt = i=1 gi /t reduces to a simple arithmetic moving
+ (1 − βˆ2 i ) βˆ2 j (E[gi2 ] − E[gt2 ]). average of the history of squared gradients.
i=1 j=i+1
Table 2. BLEU scores for Transformer on WMT ’14 En → De translation task (higher is better). Each optimization scheme was tested with
and without a warmup period. For the tests with warmup, st = min(10−6 · t, √1t ). For the tests without warmup, st = min(10−2 , √1t ).
trices in the model. We test Adam and Adafactor with this 1 1 25.8 26.4
√ 1
1 24.2 25.4
“clever” embedding scheme, but also with two more naive dmodel
schemes. In the first, we initialize the embedding parame-
ters with standard deviation 1 and do not scale them in the
computation. In the second, we initialize the embedding pa-
rameters with standard deviation √d 1 , and do not scale
9. Experimental Setup
model
them in the computation. For the Adam experiments, we We evaluated the optimization algorithms described in this
use the hyperparameters and step size scheme from Vaswani paper on the Transformer machine translation model de-
et al. (2017). For the Adafactor experiments, we use our scribed in Vaswani et al. (2017) on the same WMT 2014
recommended hyperparameters listed in Algorithm 6. All English-to-German translation task described in that paper,
models are trained for 50,000 steps with batch size 16,384 using the latest version of the architecture from the Ten-
tokens (unlike the other experiments in this paper). Results sor2Tensor open-source repository.
are given in Table 3. Adafactor proves more resilient to the
Models were trained for 100,000 steps. Each training batch
more naive parameter initialization and scaling schemes.
contained sentence pairs containing approximately 4,096
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Algorithm 4 Adafactor for weight matrices. Algorithm 6 Proposed hyperparameters for Adafactor
1: Inputs: initial point X0 ∈ Rn×m , relative step sizes 1: 1 = 10−30
{ρt }Tt=1 , second moment decay {βˆ2 t }Tt=1 such that 2: 2 = 10−3
βˆ2 1 = 0, regularization constants 1 and 2 , clipping 3: d = 1
threshold d 4: ρt = min 10−2 , √1t
2: for t = 1 to T do 5: βˆ2 = 1 − t−0.8
t
3: αt = max (2 , RMS(Xt−1 )) ρt
4: Gt = ∇ft (Xt−1 )
5: Rt = βˆ2 t Rt−1 + (1 − βˆ2 t )(G2t + 1 1n 1>m )1m
6: Ct = βˆ2 t Ct−1 + (1 − βˆ2 t )1>
n (G 2
t + 1 1 >
n 1m ) 9.1. Results
>
7: V̂t = Rt Cpt /1n Rt Results are listed in Table 2. The listed BLEU scores are
8: Ut = Gt / V̂t on the development set, newstest2013, using beam search
9: Ût = Ut / max (1, RMS(Ut )/d) with beam size 4 and length penalty α = 0.6. Higher
10: Xt = Xt−1 − αt Ût scores are better. Note that the scores listed should not be
11: end for compared to those in Vaswani et al. (2017), due to both
our shorter training regime and various improvements in
the open-source version of the model over the published
Algorithm 5 Adafactor for weight vectors. version.
1: Inputs: initial point X0 ∈ Rn , relative step sizes The schemes with warmup mostly achieved very similar
{ρt }Tt=1 , second moment decay {βˆ2 t }Tt=1 such that results. Fast decay of the second-moment estimator (G) was
βˆ2 1 = 0, regularization constants 1 and 2 , clipping significantly worse.
threshold d
Without warmup, the baseline (A) becomes unstable. The
2: for t = 1 to T do
instability is relieved by any of momentum (B), fast decay
3: αt = max (2 , RMS(Xt−1 )) ρt
(G), variable decay (K), and gradient clipping (H). It is not
4: Gt = ∇ft (Xt−1 )
clear whether relative step size has an affect on stability,
5: V̂t = βˆ2 t V̂p ˆ 2
t−1 + (1 − β2 t )(Gt + 1 1n )
since the step sizes used in the experiments are not directly
6: Ut = Gt / V̂t comparable.
7: Ût = Ut / max (1, RMS(Ut )/d)
8: Xt = Xt−1 − αt Ût Rows (J) and (N) demonstrate algorithms with sub-linear
9: end for additional memory requirements which attain comparable
convergence and stability results to Adam with momentum.
Results for SGD (Q) were poorer and less stable than Adam,
and highly dependent on choice of learning rate.
tokens in the input and 4,096 tokens in the target sentences.
These batches are about 8 times smaller than the ones used
by Vaswani et al. (2017). This causes our results to be
10. Conclusion
slightly worse, but significantly speeds up training times On a popular machine translation task, we have demon-
(less than two hours each on one Google TPU v2). strated similar quality results to Adam, using a sublinear
In one set of experiments, we followed a similar step size amount of extra space for accumulators. This should en-
schedule as Vaswani et al. (2017) consisting of a linear able training of significantly larger models on the same
warmup followed by inverse-square root decay, given by memory-constrained hardware. We have also introduced up-
αt = 0.1 · min(10−6 · t, √1t ). In order to test training date clipping, a potentially more-generally-useful technique
for stabilizing adaptive gradient methods.
stability, we ran a second set of experiments where the
initial warmup was replaced by a flat learning rate: αt = Code for running Adafactor is available in the open-source
0.1 · min(10−2 , √1t ). For the experiments with relative step Tensor2Tensor library.
sizes, we used schedules ρt = min(10−6 · t, √1t ) and ρt =
min(10−2 , √1t ). Acknowledgements
In addition, we tested plain SGD with learning rate schemes Thanks to Łukasz Kaiser, the Tensor2Tensor team and the
equal to the step size schemes above, multiplied by various open-source community for helping test and debug Adafac-
constants, since SGD also requires little (zero) additional tor. Also thanks to Geoffrey Hinton, who asserted that train-
memory cost. ing works well if the magnitudes of parameter updates are
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
about 10−2 to 10−3 times the magnitude of the parameters. Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof,
Davis, Andy, Le, Quoc, Hinton, Geoffrey, and Dean, Jeff.
References Outrageously large neural networks: The sparsely-gated
mixture-of-experts layer. In ICLR, 2017. URL https:
Duchi, John C., Hazan, Elad, and Singer, Yoram. //openreview.net/pdf?id=B1ckMDqlg.
Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Divide
Learning Research, 12:2121–2159, 2011. URL the gradient by a running average of its recent magnitude.
http://dblp.uni-trier.de/db/journals/ COURSERA: Neural Networks for Machine Learning,
jmlr/jmlr12.html#DuchiHS11. 2012.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszko-
Eckart, C. and Young, G. The approximation of one matrix
reit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser,
by another of lower rank. Psychometrika, 1(3):211–218,
Łukasz, and Polosukhin, Illia. Attention is all you
1936. doi: 10.1007/BF02288367.
need. In Guyon, I., Luxburg, U. V., Bengio, S., Wal-
lach, H., Fergus, R., Vishwanathan, S., and Garnett,
Finesso, Lorenzo and Spreij, Peter. Nonnegative
R. (eds.), Advances in Neural Information Processing
matrix factorization and I-divergence alternating
Systems 30, pp. 6000–6010. Curran Associates, Inc.,
minimization. Linear Algebra and its Applications,
2017. URL http://papers.nips.cc/paper/
416(2):270 – 287, 2006. ISSN 0024-3795. doi:
7181-attention-is-all-you-need.pdf.
https://doi.org/10.1016/j.laa.2005.11.012. URL
http://www.sciencedirect.com/science/ Zeiler, Matthew D. Adadelta: An adaptive learning
article/pii/S0024379505005665. rate method. CoRR, abs/1212.5701, 2012. URL
http://dblp.uni-trier.de/db/journals/
Goyal, Priya, Dollár, Piotr, Girshick, Ross B., Noordhuis, corr/corr1212.html#abs-1212-5701.
Pieter, Wesolowski, Lukasz, Kyrola, Aapo, Tulloch, An-
drew, Jia, Yangqing, and He, Kaiming. Accurate, large
minibatch SGD: training imagenet in 1 hour. CoRR,
abs/1706.02677, 2017. URL http://arxiv.org/
abs/1706.02677.