(2020129) On Layer Normalization in The Transformer Architecture
(2020129) On Layer Normalization in The Transformer Architecture
Ruibin Xiong†* 1 2 Yunchang Yang* 3 Di He 4 5 Kai Zheng 4 Shuxin Zheng 5 Chen Xing 6 Huishuai Zhang 5
Yanyan Lan 1 2 Liwei Wang 4 3 Tie-Yan Liu 5
Abstract 1. Introduction
The Transformer is widely used in natural lan- The Transformer (Vaswani et al., 2017) is one of the most
arXiv:2002.04745v2 [cs.LG] 29 Jun 2020
guage processing tasks. To train a Transformer commonly used neural network architectures in natural lan-
however, one usually needs a carefully designed guage processing. Layer normalization (Lei Ba et al., 2016)
learning rate warm-up stage, which is shown to plays a key role in Transformer’s success. The originally de-
be crucial to the final performance but will slow signed Transformer places the layer normalization between
down the optimization and bring more hyper- the residual blocks, which is usually referred to as the Trans-
parameter tunings. In this paper, we first study former with Post-Layer Normalization (Post-LN) (Wang
theoretically why the learning rate warm-up stage et al., 2019). This architecture has achieved state-of-the-art
is essential and show that the location of layer nor- performance in many tasks including language modeling
malization matters. Specifically, we prove with (Dai et al., 2019; Al-Rfou et al., 2018) and machine transla-
mean field theory that at initialization, for the tion (Dehghani et al., 2018; Edunov et al., 2018). Unsuper-
original-designed Post-LN Transformer, which vised pre-trained models based on the Post-LN Transformer
places the layer normalization between the resid- architecture also show impressive performance in many
ual blocks, the expected gradients of the parame- downstream tasks (Radford et al., 2019; Devlin et al., 2018;
ters near the output layer are large. Therefore, us- Yang et al., 2019b).
ing a large learning rate on those gradients makes Despite its great success, people usually need to deal with
the training unstable. The warm-up stage is prac- the optimization of the Post-LN Transformer more carefully
tically helpful for avoiding this problem. On the than convolutional networks or other sequence-to-sequence
other hand, our theory also shows that if the layer models (Popel & Bojar, 2018). In particular, to train the
normalization is put inside the residual blocks model from scratch, any gradient-based optimization ap-
(recently proposed as Pre-LN Transformer), the proach requires a learning rate warm-up stage (Vaswani
gradients are well-behaved at initialization. This et al., 2017; Liu et al., 2019a): the optimization starts with
motivates us to remove the warm-up stage for the an extremely small learning rate, and then gradually in-
training of Pre-LN Transformers. We show in our creases it to a pre-defined maximum value in a pre-defined
experiments that Pre-LN Transformers without number of iterations. Such a warm-up stage not only slows
the warm-up stage can reach comparable results down the optimization process but also brings more hyper-
with baselines while requiring significantly less parameter tunings. Popel & Bojar (2018) has shown that
training time and hyper-parameter tuning on a the final model performance is quite sensitive to the value
wide range of applications. of the maximum learning rate and the number of warm-up
iterations. Tuning such sensitive hyper-parameters is costly
*
Equal contribution † Works done while interning at Microsoft in training large-scale models, e.g., BERT (Devlin et al.,
Research Asia 1 CAS Key Laboratory of Network Data Science 2018) or XLNet (Yang et al., 2019b).
and Technology, Institute of Computing Technolog, Chinese
Academy of Sciences 2 University of Chinese Academy of Sci- In this paper, we try to alleviate this problem by finding
ences 3 Center for Data Science, Peking University, Beijing Insti- ways to safely remove the learning rate warm-up stage. As
tute of Big Data Research 4 Key Laboratory of Machine Perception, the warm-up stage happens in the first several iterations, we
MOE, School of EECS, Peking University 5 Microsoft Research investigate the optimization behavior at initialization using
6
College of Computer Science, Nankai University. Correspon- mean field theory (Lee et al., 2017; Xiao et al., 2018; Yang
dence to: Shuxin Zheng <[email protected]>, Di He
<[email protected]>. et al., 2019a; Yang, 2019; Lee et al., 2019; Zhang et al.,
2019). According to our theoretical analysis, when putting
Proceedings of the 37 th International Conference on Machine the layer normalization between the residual blocks, the
Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by expected gradients of the parameters near the output layer
the author(s).
On Layer Normalization in the Transformer Architecture
2. Related work
Gradient descent-based methods (Kingma & Ba, 2014;
Zeiler, 2012; Duchi et al., 2011; Tieleman & Hinton, 2012)
are popularly used in optimizing deep neural networks. For
convolutional neural networks and recurrent neural net-
Figure 1. (a) Post-LN Transformer layer; (b) Pre-LN Transformer works, a relatively large learning rate is usually set in the be-
layer.
ginning, and then decreased along with the optimization pro-
are large. Therefore, without the warm-up stage, directly cess (He et al., 2016; 2017; Sutskever et al., 2014; Gehring
using a large learning rate to those parameters can make et al., 2017; He et al., 2019). The learning rate warm-up
the optimization process unstable. Using a warm-up stage stage has only been shown essential in dealing with some
and training the model with small learning rates practically very specific problems, e.g., the large-batch training. Goyal
avoid this problem. Extensive experiments are provided to et al. (2017); He et al. (2019); You et al. (2018) showed that
support our theoretical findings. a learning rate warm-up stage is preferred when training
Our theory also shows that the layer normalization plays a neural networks with extremely large batch sizes.
crucial role in controlling the gradient scales. This motivates However, the learning rate warm-up stage is essential and
us to investigate whether there are some other ways of po- critical when optimizing the Transformer models in a ma-
sitioning the layer normalization that lead to well-behaved jority of scenarios (Vaswani et al., 2017; Devlin et al., 2018;
gradients. In particular, we study another variant, the Trans- Dai et al., 2019; Radford et al., 2019; Lu et al., 2019). Popel
former with Pre-Layer Normalization (Pre-LN) (Baevski & Bojar (2018) investigated the influence of different warm-
& Auli, 2018; Child et al., 2019; Wang et al., 2019). The up strategies for the optimization of the Post-LN Trans-
Pre-LN Transformer puts the layer normalization inside the former model and found that without or with relatively less
residual connection and equips with an additional final-layer warm-up iterations, the optimization diverges. The Pre-
normalization before prediction (Please see Figure 1 for the LN Transformer has been proposed in several recent works
differences between the two variants of the Transformer (Baevski & Auli, 2018; Child et al., 2019; Wang et al., 2019)
architectures). We show that at initialization, the gradients to alleviate some optimization issues when training deeper
are well-behaved without any exploding or vanishing for the models, but the troublesome warm-up stage still remains in
Pre-LN Transformer both theoretically and empirically. their training pipelines.
Given the gradients are well-behaved in the Pre-LN Trans- (Liu et al., 2019a) claimed that the benefit of the warm-up
former, it is natural to consider removing the learning rate stage comes from reducing the variance for the adaptive
warm-up stage during training. We conduct a variety of learning rate in the Adam optimizer (Kingma & Ba, 2014).
experiments, including IWSLT14 German-English transla- They proposed to rectify the variance of adaptive learning
tion, WMT14 English-German translation, and BERT pre- rate by a new variant of Adam called RAdam. However,
training tasks. We show that, in all tasks, the learning rate we find that not only for Adam, the learning rate warm-up
warm-up stage can be safely removed, and thus, the number stage also helps quite a lot for other optimizers. This may
of hyper-parameter is reduced. Furthermore, we observe indicate that Adam is not the prerequisite for the necessity of
that the loss decays faster for the Pre-LN Transformer model. the warm-up stage. In a concurrent and independent work,
It can achieve comparable final performances but use much Nguyen & Salazar (2019) also empirically observed that the
less training time. This is particularly important for training Pre-LN Transformer can be trained without learning rate
large-scale models on large-scale datasets. warm-up stage. Our work provides a more comprehensive
Our contributions are summarized as follows: study regrading this with a theoretical analysis.
3. Optimization for the Transformer and layer normalization are also key components to the
Transformer. For any vector v, the layer normalization is
3.1. Transformer with Post-Layer Normalization computed as LayerNorm(v) = γ v−µ σ + β, in which µ, σ
The Transformer architecture usually consists of stacked are the meanPdand standard Pd of the elements in v,
deviation
Transformer layers (Vaswani et al., 2017; Devlin et al., i.e., µ = d1 k=1 vk and σ 2 = d1 k=1 (vk − µ)2 . Scale γ
2018), each of which takes a sequence of vectors as input and bias vector β are parameters.
and outputs a new sequence of vectors with the same shape. Different orders of the sub-layers, residual connection and
A Transformer layer has two sub-layers: the (multi-head) layer normalization in a Transformer layer lead to variants
self-attention sub-layer and the position-wise feed-forward of Transformer architectures. One of the original and most
network sub-layer. Residual connection (He et al., 2016) popularly used architecture for the Transformer and BERT
and layer normalization (Lei Ba et al., 2016) are applied (Vaswani et al., 2017; Devlin et al., 2018) follows “self-
for both sub-layers individually. We first introduce each attention (FFN) sub-layer → residual connection → layer
component of the Transformer layer and then present the normalization”, which we call the Transformer with Post-
entire architecture. Layer normalization (Post-LN Transformer), as illustrated
in Figure 1.
Self-attention sub-layer An attention function can be
formulated as querying an entry with key-value pairs
Post-LN Transformer Denote xl,i as the input of the l-th
(Vaswani et al., 2017). The self-attention sub-layer
Transformer layer at position i, where xl,i is a real-valued
uses scaled dot-product attention, which is defined as:
T vector of dimension d, i = 1, 2, ..., n, l = 1, 2, ..., L. n is
Attention(Q, K, V ) = softmax( QK √ )V , where d is the di-
d the length of the sequence and L is the number of layers.
mensionality of the hidden representations, and Q (Query), For completeness, we define x0,i as the input embedding at
K (Key), V (Value) are specified as the hidden represen- position i which is usually a combination of word embed-
tations of the previous layer. The multi-head variant of ding and positional embedding. The computations inside
the self-attention sub-layer is popularly used which allows the l-th layer are composed of several steps, and we use
the model to jointly attend to information from different super-scripts on x to present the input(output) of different
representation sub-spaces, and is defined as steps as in Table 1 (left), where W 1,l , W 2,l , b1,l and b2,l are
Multi-head(Q, K, V ) = Concat(head1 , · · · , headH )W O parameters of the FFN sub-layer in the l-th layer.
headk = Attention(QWkQ , KWkK , V WkV ), 3.2. The learning rate warm-up stage
where WkQ ∈ R d×dK
, WkK ∈ d×dK
R , WkV ∈ We are interested in the learning rate warm-up stage in the
HdV ×d
Rd×dV , and W O ∈ R are project param- optimization of the Post-LN Transformer. Different from the
eter matrices, H is the number of heads. dK optimization of many other architectures in which the learn-
and dV are the dimensionalities of Key and Value. ing rate starts from a relatively large value and then decays
Without any confusion, given a sequence of vectors (Bahdanau et al., 2017; Dauphin et al., 2017), a learning rate
(x1 , ..., xn ), we use MultiHeadAtt(xi , [x1 , x2 , · · · , xn ]) warm-up stage for the Post-LN Transformer seems critical
as the multi-head self-attention mechanism on position (Popel & Bojar, 2018). We denote the learning rate of the
i which considers the attention from xi to the en- t-th iteration as lr(t) and the maximum learning rate during
tire sequence, i.e., MultiHeadAtt(xi , [x1 , x2 , · · · , xn ]) = training as lrmax . Given a predefined time frame Twarmup ,
Multi-head(xi , [x1 , . . . , xn ], [x1 , . . . , xn ]). the learning rate scheduler for the first Twarmup iterations
(Vaswani et al., 2018) is defined as
Position-wise FFN sub-layer In addition to the self-
attention sub-layer, each Transformer layer contains a fully t
lr(t) = lrmax , t ≤ Twarmup . (1)
connected network, which is applied to each position sep- Twarmup
arately and identically. This sub-layer is a two-layer feed-
forward network with a ReLU activation function. Given After this warm-up stage, the learning rate will be set by
a sequence of vectors h1 , ..., hn , the computation of a classical learning rate schedulers, such as the linear decay,
position-wise FFN sub-layer on any hi is defined as: the inverse square-root decay, or forced decay at particular
iterations. We conduct experiments to show that this learn-
FFN(hi ) = ReLU(hi W 1 + b1 )W 2 + b2 , ing rate warm-up stage is essential for training Post-LN
where W 1 , W 2 , b1 and b2 are parameters. Transformer models.
Residual connection and layer normalization Besides Experimental setting We conduct experiments on the
the two sub-layers described above, the residual connection IWSLT14 German-to-English (De-En) machine translation
On Layer Normalization in the Transformer Architecture