0% found this document useful (0 votes)
60 views

(2020129) On Layer Normalization in The Transformer Architecture

Uploaded by

Louis Lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

(2020129) On Layer Normalization in The Transformer Architecture

Uploaded by

Louis Lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

On Layer Normalization in the Transformer Architecture

Ruibin Xiong†* 1 2 Yunchang Yang* 3 Di He 4 5 Kai Zheng 4 Shuxin Zheng 5 Chen Xing 6 Huishuai Zhang 5
Yanyan Lan 1 2 Liwei Wang 4 3 Tie-Yan Liu 5

Abstract 1. Introduction
The Transformer is widely used in natural lan- The Transformer (Vaswani et al., 2017) is one of the most
arXiv:2002.04745v2 [cs.LG] 29 Jun 2020

guage processing tasks. To train a Transformer commonly used neural network architectures in natural lan-
however, one usually needs a carefully designed guage processing. Layer normalization (Lei Ba et al., 2016)
learning rate warm-up stage, which is shown to plays a key role in Transformer’s success. The originally de-
be crucial to the final performance but will slow signed Transformer places the layer normalization between
down the optimization and bring more hyper- the residual blocks, which is usually referred to as the Trans-
parameter tunings. In this paper, we first study former with Post-Layer Normalization (Post-LN) (Wang
theoretically why the learning rate warm-up stage et al., 2019). This architecture has achieved state-of-the-art
is essential and show that the location of layer nor- performance in many tasks including language modeling
malization matters. Specifically, we prove with (Dai et al., 2019; Al-Rfou et al., 2018) and machine transla-
mean field theory that at initialization, for the tion (Dehghani et al., 2018; Edunov et al., 2018). Unsuper-
original-designed Post-LN Transformer, which vised pre-trained models based on the Post-LN Transformer
places the layer normalization between the resid- architecture also show impressive performance in many
ual blocks, the expected gradients of the parame- downstream tasks (Radford et al., 2019; Devlin et al., 2018;
ters near the output layer are large. Therefore, us- Yang et al., 2019b).
ing a large learning rate on those gradients makes Despite its great success, people usually need to deal with
the training unstable. The warm-up stage is prac- the optimization of the Post-LN Transformer more carefully
tically helpful for avoiding this problem. On the than convolutional networks or other sequence-to-sequence
other hand, our theory also shows that if the layer models (Popel & Bojar, 2018). In particular, to train the
normalization is put inside the residual blocks model from scratch, any gradient-based optimization ap-
(recently proposed as Pre-LN Transformer), the proach requires a learning rate warm-up stage (Vaswani
gradients are well-behaved at initialization. This et al., 2017; Liu et al., 2019a): the optimization starts with
motivates us to remove the warm-up stage for the an extremely small learning rate, and then gradually in-
training of Pre-LN Transformers. We show in our creases it to a pre-defined maximum value in a pre-defined
experiments that Pre-LN Transformers without number of iterations. Such a warm-up stage not only slows
the warm-up stage can reach comparable results down the optimization process but also brings more hyper-
with baselines while requiring significantly less parameter tunings. Popel & Bojar (2018) has shown that
training time and hyper-parameter tuning on a the final model performance is quite sensitive to the value
wide range of applications. of the maximum learning rate and the number of warm-up
iterations. Tuning such sensitive hyper-parameters is costly
*
Equal contribution † Works done while interning at Microsoft in training large-scale models, e.g., BERT (Devlin et al.,
Research Asia 1 CAS Key Laboratory of Network Data Science 2018) or XLNet (Yang et al., 2019b).
and Technology, Institute of Computing Technolog, Chinese
Academy of Sciences 2 University of Chinese Academy of Sci- In this paper, we try to alleviate this problem by finding
ences 3 Center for Data Science, Peking University, Beijing Insti- ways to safely remove the learning rate warm-up stage. As
tute of Big Data Research 4 Key Laboratory of Machine Perception, the warm-up stage happens in the first several iterations, we
MOE, School of EECS, Peking University 5 Microsoft Research investigate the optimization behavior at initialization using
6
College of Computer Science, Nankai University. Correspon- mean field theory (Lee et al., 2017; Xiao et al., 2018; Yang
dence to: Shuxin Zheng <[email protected]>, Di He
<[email protected]>. et al., 2019a; Yang, 2019; Lee et al., 2019; Zhang et al.,
2019). According to our theoretical analysis, when putting
Proceedings of the 37 th International Conference on Machine the layer normalization between the residual blocks, the
Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by expected gradients of the parameters near the output layer
the author(s).
On Layer Normalization in the Transformer Architecture

Transformer and the Pre-LN Transformer, using mean field


theory. By studying the gradients at initialization, we pro-
vide evidence to show why the learning rate warm-up stage
is essential in training the Post-LN Transformer.
• We are the first to show that the learning-rate warm-up
stage can be removed for the Pre-LN Transformer, which
eases the hyperparameter tuning. We further show that by
using proper learning rate schedulers, the training time can
be largely reduced on a wide range of applications.

2. Related work
Gradient descent-based methods (Kingma & Ba, 2014;
Zeiler, 2012; Duchi et al., 2011; Tieleman & Hinton, 2012)
are popularly used in optimizing deep neural networks. For
convolutional neural networks and recurrent neural net-
Figure 1. (a) Post-LN Transformer layer; (b) Pre-LN Transformer works, a relatively large learning rate is usually set in the be-
layer.
ginning, and then decreased along with the optimization pro-
are large. Therefore, without the warm-up stage, directly cess (He et al., 2016; 2017; Sutskever et al., 2014; Gehring
using a large learning rate to those parameters can make et al., 2017; He et al., 2019). The learning rate warm-up
the optimization process unstable. Using a warm-up stage stage has only been shown essential in dealing with some
and training the model with small learning rates practically very specific problems, e.g., the large-batch training. Goyal
avoid this problem. Extensive experiments are provided to et al. (2017); He et al. (2019); You et al. (2018) showed that
support our theoretical findings. a learning rate warm-up stage is preferred when training
Our theory also shows that the layer normalization plays a neural networks with extremely large batch sizes.
crucial role in controlling the gradient scales. This motivates However, the learning rate warm-up stage is essential and
us to investigate whether there are some other ways of po- critical when optimizing the Transformer models in a ma-
sitioning the layer normalization that lead to well-behaved jority of scenarios (Vaswani et al., 2017; Devlin et al., 2018;
gradients. In particular, we study another variant, the Trans- Dai et al., 2019; Radford et al., 2019; Lu et al., 2019). Popel
former with Pre-Layer Normalization (Pre-LN) (Baevski & Bojar (2018) investigated the influence of different warm-
& Auli, 2018; Child et al., 2019; Wang et al., 2019). The up strategies for the optimization of the Post-LN Trans-
Pre-LN Transformer puts the layer normalization inside the former model and found that without or with relatively less
residual connection and equips with an additional final-layer warm-up iterations, the optimization diverges. The Pre-
normalization before prediction (Please see Figure 1 for the LN Transformer has been proposed in several recent works
differences between the two variants of the Transformer (Baevski & Auli, 2018; Child et al., 2019; Wang et al., 2019)
architectures). We show that at initialization, the gradients to alleviate some optimization issues when training deeper
are well-behaved without any exploding or vanishing for the models, but the troublesome warm-up stage still remains in
Pre-LN Transformer both theoretically and empirically. their training pipelines.
Given the gradients are well-behaved in the Pre-LN Trans- (Liu et al., 2019a) claimed that the benefit of the warm-up
former, it is natural to consider removing the learning rate stage comes from reducing the variance for the adaptive
warm-up stage during training. We conduct a variety of learning rate in the Adam optimizer (Kingma & Ba, 2014).
experiments, including IWSLT14 German-English transla- They proposed to rectify the variance of adaptive learning
tion, WMT14 English-German translation, and BERT pre- rate by a new variant of Adam called RAdam. However,
training tasks. We show that, in all tasks, the learning rate we find that not only for Adam, the learning rate warm-up
warm-up stage can be safely removed, and thus, the number stage also helps quite a lot for other optimizers. This may
of hyper-parameter is reduced. Furthermore, we observe indicate that Adam is not the prerequisite for the necessity of
that the loss decays faster for the Pre-LN Transformer model. the warm-up stage. In a concurrent and independent work,
It can achieve comparable final performances but use much Nguyen & Salazar (2019) also empirically observed that the
less training time. This is particularly important for training Pre-LN Transformer can be trained without learning rate
large-scale models on large-scale datasets. warm-up stage. Our work provides a more comprehensive
Our contributions are summarized as follows: study regrading this with a theoretical analysis.

• We investigate two Transformer variants, the Post-LN


On Layer Normalization in the Transformer Architecture

3. Optimization for the Transformer and layer normalization are also key components to the
Transformer. For any vector v, the layer normalization is
3.1. Transformer with Post-Layer Normalization computed as LayerNorm(v) = γ v−µ σ + β, in which µ, σ
The Transformer architecture usually consists of stacked are the meanPdand standard Pd of the elements in v,
deviation
Transformer layers (Vaswani et al., 2017; Devlin et al., i.e., µ = d1 k=1 vk and σ 2 = d1 k=1 (vk − µ)2 . Scale γ
2018), each of which takes a sequence of vectors as input and bias vector β are parameters.
and outputs a new sequence of vectors with the same shape. Different orders of the sub-layers, residual connection and
A Transformer layer has two sub-layers: the (multi-head) layer normalization in a Transformer layer lead to variants
self-attention sub-layer and the position-wise feed-forward of Transformer architectures. One of the original and most
network sub-layer. Residual connection (He et al., 2016) popularly used architecture for the Transformer and BERT
and layer normalization (Lei Ba et al., 2016) are applied (Vaswani et al., 2017; Devlin et al., 2018) follows “self-
for both sub-layers individually. We first introduce each attention (FFN) sub-layer → residual connection → layer
component of the Transformer layer and then present the normalization”, which we call the Transformer with Post-
entire architecture. Layer normalization (Post-LN Transformer), as illustrated
in Figure 1.
Self-attention sub-layer An attention function can be
formulated as querying an entry with key-value pairs
Post-LN Transformer Denote xl,i as the input of the l-th
(Vaswani et al., 2017). The self-attention sub-layer
Transformer layer at position i, where xl,i is a real-valued
uses scaled dot-product attention, which is defined as:
T vector of dimension d, i = 1, 2, ..., n, l = 1, 2, ..., L. n is
Attention(Q, K, V ) = softmax( QK √ )V , where d is the di-
d the length of the sequence and L is the number of layers.
mensionality of the hidden representations, and Q (Query), For completeness, we define x0,i as the input embedding at
K (Key), V (Value) are specified as the hidden represen- position i which is usually a combination of word embed-
tations of the previous layer. The multi-head variant of ding and positional embedding. The computations inside
the self-attention sub-layer is popularly used which allows the l-th layer are composed of several steps, and we use
the model to jointly attend to information from different super-scripts on x to present the input(output) of different
representation sub-spaces, and is defined as steps as in Table 1 (left), where W 1,l , W 2,l , b1,l and b2,l are
Multi-head(Q, K, V ) = Concat(head1 , · · · , headH )W O parameters of the FFN sub-layer in the l-th layer.

headk = Attention(QWkQ , KWkK , V WkV ), 3.2. The learning rate warm-up stage
where WkQ ∈ R d×dK
, WkK ∈ d×dK
R , WkV ∈ We are interested in the learning rate warm-up stage in the
HdV ×d
Rd×dV , and W O ∈ R are project param- optimization of the Post-LN Transformer. Different from the
eter matrices, H is the number of heads. dK optimization of many other architectures in which the learn-
and dV are the dimensionalities of Key and Value. ing rate starts from a relatively large value and then decays
Without any confusion, given a sequence of vectors (Bahdanau et al., 2017; Dauphin et al., 2017), a learning rate
(x1 , ..., xn ), we use MultiHeadAtt(xi , [x1 , x2 , · · · , xn ]) warm-up stage for the Post-LN Transformer seems critical
as the multi-head self-attention mechanism on position (Popel & Bojar, 2018). We denote the learning rate of the
i which considers the attention from xi to the en- t-th iteration as lr(t) and the maximum learning rate during
tire sequence, i.e., MultiHeadAtt(xi , [x1 , x2 , · · · , xn ]) = training as lrmax . Given a predefined time frame Twarmup ,
Multi-head(xi , [x1 , . . . , xn ], [x1 , . . . , xn ]). the learning rate scheduler for the first Twarmup iterations
(Vaswani et al., 2018) is defined as
Position-wise FFN sub-layer In addition to the self-
attention sub-layer, each Transformer layer contains a fully t
lr(t) = lrmax , t ≤ Twarmup . (1)
connected network, which is applied to each position sep- Twarmup
arately and identically. This sub-layer is a two-layer feed-
forward network with a ReLU activation function. Given After this warm-up stage, the learning rate will be set by
a sequence of vectors h1 , ..., hn , the computation of a classical learning rate schedulers, such as the linear decay,
position-wise FFN sub-layer on any hi is defined as: the inverse square-root decay, or forced decay at particular
iterations. We conduct experiments to show that this learn-
FFN(hi ) = ReLU(hi W 1 + b1 )W 2 + b2 , ing rate warm-up stage is essential for training Post-LN
where W 1 , W 2 , b1 and b2 are parameters. Transformer models.

Residual connection and layer normalization Besides Experimental setting We conduct experiments on the
the two sub-layers described above, the residual connection IWSLT14 German-to-English (De-En) machine translation
On Layer Normalization in the Transformer Architecture

Table 1. Post-LN Transformer v.s. Pre-LN Transformer


Post-LN Transformer Pre-LN Transformer
xpost,1
l,i = MultiHeadAtt(xpost post post
l,i , [xl,1 , · · · , xl,n ]) xpre,1
l,i = LayerNorm(xpre l,i )
xpost,2
l,i = xpost post,1
l,i + xl,i xpre,2
l,i = MultiHeadAtt(x pre,1
l,i , [xpre,1 pre,1
l,1 , · · · , xl,n ])
post,3
xl,i = LayerNorm(xpost,2
l,i ) pre,3 pre pre,2
xl,i = xl,i + xl,i
post,4 post,3
xl,i = ReLU(xl,i W + b1,l )W 2,l + b2,l
1,l
xpre,4
l,i = LayerNorm(xpre,3
l,i )
post,5
xl,i = xpost,3
l,i + xpost,4
l,i xl,i = ReLU(xpre,4
pre,5
l,i W 1,l
+ b1,l )W 2,l + b2,l
xl+1,i = LayerNorm(xpost,5
post
l,i ) pre pre,5
xl+1,i = xl,i + xl,ipre,3

Final LayerNorm: xpre pre


F inal,i ← LayerNorm(xL+1,i )

 
 
9DOLGDWLRQ/RVV

9DOLGDWLRQ/RVV
 
%/(8

%/(8
 
 
 
 
                               
(SRFKV (SRFKV (SRFKV (SRFKV
lrmax = 1e 3ZRZDUPXS lrmax = 5e 4ZRZDUPXS lrmax = 1e 3ZRZDUPXS lrmax = 5e 3ZRZDUPXS
lrmax = 1e 3Twarmup = 500 lrmax = 5e 4Twarmup = 500 lrmax = 1e 3Twarmup = 4000 lrmax = 5e 3Twarmup = 4000
lrmax = 1e 3Twarmup = 4000 lrmax = 5e 4Twarmup = 4000
(a) Loss/BLEU on the IWSLT14 De-En task (Adam) (b) Loss/BLEU on the IWSLT14 De-En task (SGD)

Figure 2. Performances of the models optimized by Adam and SGD on the IWSLT14 De-En task.

task. We mainly investigate two aspects: whether the learn- the warm-up stage” while "w/ warm-up" indicates “with the
ing rate warm-up stage is essential and whether the final warm-up stage”.
model performance is sensitive to the value of Twarmup . To
First, we can see that for both optimizers, the learning rate
study the first aspect, we train the model with the Adam
warm-up stage is essential. Without the warm-up stage, the
optimizer (Kingma & Ba, 2014) and the vanilla SGD op-
BLEU score of the model trained with Adam optimizer can
timizer (Ruder, 2016) respectively. For both optimziers,
only achieve 8.45. As a comparison, the model trained using
we check whether the warm-up stage can be removed. We
the warm-up stage can achieve around 34 in terms of BLEU
follow Vaswani et al. (2017) to set hyper-parameter β to
score. The same trend can also be observed on the validation
be (0.9, 0.98) in Adam. We also test different lrmax for
loss curves. Although the performance of the model trained
both optimizers. For Adam, we set lrmax = 5e−4 or 1e−3 ,
with SGD is significantly worse than Adam, we can still see
and for SGD, we set lrmax = 5e−3 or 1e−3 . When the
similar phenomena as Adam. The BLEU score is just above
warm-up stage is used, we set Twarmup = 4000 as suggested
zero in 15 epochs without using the warm-up stage.
by the original paper (Vaswani et al., 2017). To study the
second aspect, we set Twarmup to be 1/500/4000 (“1” refers Second, we can see that the optimization process is sensitive
to the no warm-up setting) and use lrmax = 5e−4 or 1e−3 to the value of Twarmup , which means Twarmup is an important
with Adam. For all experiments, a same inverse square root hyper-parameter in training the Post-LN Transformer. For
learning rate scheduler is used after the warm-up stage. We example, when setting Twarmup = 500, the learned models
use both validation loss and BLEU (Papineni et al., 2002) with Adam achieve only 31.16 and 2.77 in term of BLEU
as the evaluation measure of the model performance. score for lrmax = 5e−4 and 1e−3 respectively.
Such a warm-up stage has several disadvantages. First,
Results and discussions We record the model check- its configuration significantly affects the final performance.
points for every epoch during training and calculate the The practitioners need a careful hyper-parameter tuning,
validation loss and BLEU score. The performance of the which is computationally expensive for large-scale NLP
models are plotted in Figure 2(a) and Figure 2(b). The tasks. Second, the warm-up stage could slow down the op-
x-axis is the epoch number and the y-axis is the BLEU timization. Standard optimization algorithms usually start
score/validation loss. "w/o warm-up" indicates “without
On Layer Normalization in the Transformer Architecture

with a large learning rate for fast convergence. However, Klein et al., 2018; Liu et al., 2019b). Wang et al. (2019) sug-
when using the warm-up stage, the learning rate has to gested that the Pre-LN Transformer outperforms the Post-
gradually increase from zero, which may make the training LN Transformer when the number of layers increases. Dif-
inefficient. Liu et al. (2019a) suggests that the warm-up ferent from the Post-LN Transformer that puts the layer nor-
stage plays a role in reducing the undesirably significant malization between the residual blocks, the Pre-LN Trans-
variance in Adam in the early stage of model training. How- former puts the layer normalization inside the residual con-
ever, according to our results, the warm-up stage also helps nection and places it before all other non-linear transforma-
the training of SGD. This suggests that the benefit of the tions. Additionally, the Pre-LN Transformer uses a final
warm-up stage may be not for a particular optimizer. layer normalization right before the prediction. We pro-
vide the mathematical formulations and visualizations of
3.3. Understanding the Transformer at initialization the Post-LN/Pre-LN Transformer in Table 1 and Figure 1.
We can see that the Post-LN Transformer cannot be trained For both architectures, each xL,i passes through a soft-
with a large learning rate from scratch. This motivates max layer to produce a distribution over the dictionary V .
us to investigate what happens at the model initialization. The loss function is defined on the softmax distribution.
We first introduce the parameter initialization setting for our For example, in sequence prediction, the loss function is
theoretical analysis and then present our theoretical findings. defined as L(xpost
L+1,i ) = − log(softmaxyi (W
emb post
xL+1,i ))
pre
for the Post-LN Transformer and L(xF inal,i ) =
Notations We denote L(·) as the loss function of one po- − log(softmaxyi (W emb xpre
F inal,i )) for the Pre-LN Trans-
sition, L̃(·) as the loss function of the whole sequence, k·k2 former, where softmaxyi is the probability of ground truth
and k·kF as the l2 norm (spectral norm) and the Frobenius token yi outputted by the softmax distribution and W emb
norm, LN(x) as the standard layer normalization with scale is the word embedding matrix. The loss of the whole se-
quence is an average of the loss on each position. Without
γ = 1 and bias β = 0, and JLN (x) = ∂ LN ∂x
(x)
as the Ja-
loss of generality, we assume that all the derivatives are
cobian matrix of LN(x). Let O(·) denote standard Big-O
bounded. We introduce the following concentration prop-
notation that suppress multiplicative constants.
erty of random variables which will be further used in the
theorem.
Parameter Initialization The parameter matrices in each
Definition 1. A random variable Z ≥ 0 is called (, δ)-
Transformer layer are usually initialized by the Xavier ini-
bounded if with probability at least 1 − δ, Z−EZ ≤ , where
tialization (Glorot & Bengio, 2010). Given a matrix of size EZ
 > 0 and 0 < δ < 1.
nin × nout , the Xavier initialization sets the value of each
element by independently sampling from Gaussian distribu- Intuitively, if the random variable Z is (, δ)-bounded,
2
tion N (0, nin +nout
). The bias vectors are usually initialized then with a high probability its realization will not get
as zero vectors. The scale γ in the layer normalization is set too far away from its expectation. For example, if Y is
to one. a d-dimensional standard Gaussian random vector, then
For theoretical analysis, we study a simpler setting. First, Z = kY k22 is (, δ)-bounded with δ = exp(−d2 /8),
we focus on single-head attention instead of the multi- 0 <  < 1 (see supplementary material for details). As
head variant and for all layers, we set the shape of W Q,l , parameter matrices in self-attention sub-layers and FFN sub-
W K,l , W V,l , W 1,l ,W 2,l to be d × d. Second, we ini- layers are initialized by Gaussian distributions, if the norm
tialize the parameter matrices in the self-attention sub- of the hidden states in the Transformer satisfies the concen-
layer W Q,l and W K,l to be zero matrices. In this setting, trated condition above, we have the following theorem to
the attention is a uniform distribution at initialization and characterize the scale of the gradients.
MultiHeadAtt(x1l,i , [x1l,1 , x1l,2 , · · · , x1l,n ]) can be simplified Theorem 1 (Gradients of the last layer in the Transformer).
Pn
as n1 j=1 xl,j W V,l . Third, we assume the input vectors Assume that kxpost,5
L,i k22 and kxpre 2
L+1,i k2 are (, δ)-bounded
are also sampled from the same Gaussian distribution. This for all i, where  and δ = δ() are small numbers. Then

is reasonable since the inputs are linear combinations of with probability at least 0.99 − δ − 0.9+ , for the Post-LN
word embeddings and learnable positional embeddings, both Transformer with L layers, the gradient of the parameters
of which are initialized by Gaussian distributions. of the last layer satisfies
∂ L̃ √
k kF ≤ O(d ln d)
Post-LN Transformer v.s. Pre-LN Transformer We ∂W 2,L

compare the Post-LN Transformer with another variant and for the Pre-LN Transformer with L layers,
of the Transformer architecture, the Transformer with Pre- r !
Layer Normalization (Pre-LN). The Pre-LN Transformer ∂ L̃ ln d
k kF ≤ O d .
was implemented in several systems (Vaswani et al., 2018; ∂W 2,L L
On Layer Normalization in the Transformer Architecture

From Theorem 1, we can see that for the Post-LN Trans- supplementary material.
former, the scale
√ of the gradients to the last FFN layer is
of order O(d ln d) which is independent of L. For the 3.4. Empirical verification of the theory and discussion
Pre-LN Transformer, the scale of the gradients is much
smaller. We first study the forward propagation of the Post- As our theory is derived based on several simplifications of
LN Transformer and the Pre-LN Transformer. Lemma 1 the problem, we conduct experiments to study whether our
will be served as a basic tool to prove the main theorem and theoretical insights are consistent with what we observe in
other lemmas. real scenarios. The general model and training configuration
exactly follow Section 3.2. The experiments are repeated
Lemma 1. If X ∈ Rd is a Gaussian vector, X ∼ ten times using different random seeds.
N (0, σ 2 Id ), then E(kReLU(X)k22 ) = 12 σ 2 d.

Based on Lemma 1, we have the following lemma to esti- On the concentration property Given an initialized
mate the scale of the hidden states in different layers for the model, we record the hidden states in the Post-LN/Pre-LN
Post-LN Transformer and the Pre-LN Transformer. Transformer across batches and find that the norm of the
hidden states satisfies the property ((0.1,0.125)-bounded).
Lemma 2. At initialization, for the Post-LN Transformer,
E(kxpost,5
l,i k22 ) = 32 d for all l > 0 and i. For the Pre-LN On Theorem 1 Theorem 1 suggests that for any sizes of
Transformer, (1 + 2l )d ≤ E(kxpre 2 3l
l,i k2 ) ≤ (1 + 2 )d for all the Post-LN Transformer, the scale of the gradient norm in
l > 0 and i. Expectations are taken over the input and the the last FFN sub-layer remains the same. On the contrary,
randomness of initialization. that of the Pre-LN Transformer decreases as the size of the
model grows. We calculate and record the gradient norm in
Lemma 2 studies the expected norm of the hidden states the last FFN sub-layer in 6-6/8-8/10-10/12-12/14-14 Post-
in both Post-LN/Pre-LN Transformer. It is obviously √ that LN/Pre-LN Transformer models at initialization. The results
in the Post-LN Transformer, the norm of xpost l,i is d and are plotted in Figure 3(c) and 3(d). The x-axis is the size of
thus we study the norm of xpost,5
l,i instead. As we can see the model, and the y-axis is the value of the gradient norm
from Lemma 2, the scale of the hidden states in the Post-LN of W 2 in the final FFN sub-layer. The figures show when
Transformer keeps to be the same in expectation while the the number of layers grows, the gradient norm remains in
scale of the hidden states in the Pre-LN Transformer grows the Post-LN Transformer (around 1.6) and decreases in the
linearly along with the depth. The next lemma shows that Pre-LN Transformer. This observation is consistent with
the scale of the hidden states highly relates to the scale of our theory.
the gradient in the architectures using layer normalization.

Lemma 3. For x ∈ Rd , we have kJLN (x)k2 = O( kxkd2 ) in On the extended theory We calculate the gradient norm
of each paramter matrix in 6-6 Post-LN/Pre-LN Transformer.
which J (x) = ∂ LN(x) .
LN ∂x We record the gradient for each parameter for different mini-
batches. For elements in a parameter matrix, we calculate
The proof of Lemma 1, Lemma 2, Lemma 3, and Theorem
their expected gradients and use the Frobenius norm of
1 can be found in the supplementary material. The main
those values as the scale of the expected gradient of the
idea is that the layer normalization will normalize the gra-
matrix. Figure 3(a) and 3(b) shows those statistics for FFN
dients. In the Post-LN Transformer, the scale of the inputs
sub-layers. The x-axis indexes different Transformer layers.
to the layer normalization is independent of L, and thus the
It can be seen from the figure, the scale of the expected
gradients of parameters in the last layer are independent of
gradients grows along with the layer index for the Post-LN
L. While in the Pre-LN Transformer, the scale of the input
Transformer. On the contrary, the scale almost keeps the
to the final layer normalization is linear in L, and√thus the
same for different layers in the Pre-LN Transformer. These
gradients of all parameters will be normalized by L.
observations are consistent with our theoretical findings.
Extended theory to other layers/parameters We have
The critical warm-up stage for Post-LN Transformer
provided a formal proof on the gradients of the last FFN sub-
Given the analysis above, we hypothesize that the gradi-
layer as above. In order to fully understand the optimization,
ent scale is one of the reasons that the Post-LN Transformer
we also make some preliminary analysis for other layers
needs a careful learning rate scheduling. Since the gradients
and other parameters. Our main result is that the gradient
are large for some layers, using a large learning rate without
norm in the Post-LN Transformer is large for the parameters
warm-up may make the training unstable.
near the output and will be likely to decay as the layer index
l decreases. On the contrary, the gradient norm in the Pre- To verify this argument, first, we study the gradient statistics
Transformer will be likely to stay the same for any layer l. for the Post-LN Transformer after the warm-up stage with
All the preliminary theoretical results are provided in the Adam. It can be seen from Figure 3(a) and 3(b) that the scale
On Layer Normalization in the Transformer Architecture


*UDGLHQW([SHFWDWLRQ

*UDGLHQW([SHFWDWLRQ

*UDGLHQW([SHFWDWLRQ

*UDGLHQW([SHFWDWLRQ
3UH/1 LQLW  3UH/1 LQLW 3UH/1
 3RVW/1 LQLW 3RVW/1 LQLW  
3RVW/1 DIWHUZDUPXS 3RVW/1 DIWHUZDUPXS


 
 
3RVW/1
                         
/D\HU /D\HU /D\HU /D\HU
(a) W 1 in the FFN sub-layers (b) W 2 in the FFN sub-layers (c) Pre-LN Transformer (d) Post-LN Transformer

Figure 3. The norm of gradients of 1. different layers in the 6-6 Transformer (a,b). 2. W 2,L in different size of the Transformer (c,d).

of the gradients are very small, and the model can be trained experiments above, we use the Adam optimizer and set the
with large learning rates. Second, we conduct an experiment hyper-parameter β to be (0.9, 0.98). We set lrmax as same
to train the Post-LN Transformer from scratch using a fixed as the initial learning rates of the Pre-LN Transformer in
small learning rate, i.e., 1e−4 , to verify whether using small- each corresponding experiment. Since Liu et al. (2019a) sug-
step updates mitigates the issue. The details are provided gests that the learning rate warm-up stage can be removed
in the supplementary material. In general, using a very using RAdam, we try this optimizer on the IWSLT14 De-En
small and fixed learning rate can mitigate the problem and task. We use linear learning rate decay suggested by Liu
optimize the Post-LN Transformer to a certain extent but et al. (2019a) and keep all other hyper-parameters to be the
the convergence is significantly slower. Both experiments same as in other experiments.
above are supportive to our claim.
Unsupervised Pre-training (BERT) We follow (Devlin
4. Experiments et al., 2018) to use English Wikipedia corpus and Book-
Corpus for pre-training. As the dataset BookCorpus (Zhu
We find in the previous section that the gradients at initializa- et al., 2015) is no longer freely distributed. We follow the
tion for Pre-LN Transformer are well-behaved. Given this suggestions from (Devlin et al., 2018) to crawl and collect
observation, we deduce that the learning rate warm-up stage BookCorpus on our own. The concatenation of two datasets
can be safely removed when training Pre-LN Transformer. contains roughly 3.4B words in total, which is compara-
In this section, we empirically verify it on two main tasks ble with the data corpus used in (Devlin et al., 2018). We
in NLP, machine translation and unsupervised pre-training. randomly split documents into one training set and one vali-
dation set. The training-validation ratio for pre-training is
4.1. Experiment Settings 199:1.
Machine Translation We conduct our experiments on We use base model configuration in our experiments. Simi-
two widely used tasks: the IWSLT14 German-to-English lar to the translation task, we train the Pre-LN BERT without
(De-En) task and the WMT14 English-to-German (En-De) the warm-up stage and compare it with the Post-LN BERT.
task. For the IWSLT14 De-En task, we use the same model We follow the same hyper-parameter configuration in Devlin
configuration as in Section 3. For the WMT14 En-De task, et al. (2018) to train the Post-LN BERT using 10k warm-
we use the Transformer base setting. More details can be up steps with lrmax = 1e−4 . For the Pre-LN BERT, we
found in the supplementary material. use linear learning rate decay starting from 3e−4 without
the warm-up stage. We have tried to use a larger learning
For training the Pre-LN Transformer, we remove the learn-
rate (such as 3e−4 ) for the Post-LN BERT but found the
ing rate warm-up stage. On the IWSLT14 De-En task, we
optimization diverged.
set the initial learning rate to be 5e−4 and decay the learning
rate at the 8-th epoch by 0.1. On the WMT14 En-De task,
4.2. Experiment Results
we run two experiments in which the initial learning rates
are set to be 7e−4 /1.5e−3 respectively. Both learning rates Machine Translation We record the model checkpoints
are decayed at the 6-th epoch followed by the inverse square for every epoch during training and calculate the validation
root learning rate scheduler. loss and BLEU score. The performance of the models at
We train the Post-LN Transformer using the learning rate different checkpoints are plotted in Figure 4(a) - 4(d).
warm-up stage as the baseline. In both IWSLT14 De-En task First, as we can see from the figure, the learning rate warm-
and WMT14 En-De task, we set the number of the warm-up up stage is not critical anymore for training the Pre-LN
stage to be 4000 following Vaswani et al. (2017) and then Transformer and the performance of the learned model is
use the inverse square root learning rate scheduler. For all competitive. For example, on the IWSLT14 De-En task, the
On Layer Normalization in the Transformer Architecture

 
3RVW/1 5$GDPZRZDUPXS
 3RVW/1 lrmax = 7e 4ZZDUPXS 
9DOLGDWLRQ/RVV

9DOLGDWLRQ/RVV
3UH/1 5$GDPZRZDUPXS 3RVW/1 lrmax = 1.5e 3ZZDUPXS
 3RVW/1 $GDPZZDUPXS  
3UH/1 $GDPZRZDUPXS 3UH/1 lrmax = 7e 4ZRZDUPXS

%/(8

%/(8
  3UH/1 lrmax = 1.5e 3ZRZDUPXS

 3RVW/1 lrmax = 7e 4ZZDUPXS
3RVW/1 5$GDPZRZDUPXS
 3RVW/1 lrmax = 1.5e 3ZZDUPXS
  3UH/1 5$GDPZRZDUPXS  3UH/1 lrmax = 7e 4ZRZDUPXS
3RVW/1 $GDPZZDUPXS
3UH/1 $GDPZRZDUPXS  3UH/1 lrmax = 1.5e 3ZRZDUPXS
  
                                 
(SRFKV (SRFKV (SRFKV (SRFKV
(a) Validation Loss (IWSLT) (b) BLEU (IWSLT) (c) Validation Loss (WMT) (d) BLEU (WMT)

Figure 4. Performances of the models on the IWSLT14 De-En task and WMT14 En-De task

3RVW/1  


 3UH/1  

9DOLGDWLRQ/RVV

 

$FF

$FF
 

  
 
 3RVW/1 3RVW/1
 3UH/1  3UH/1
              
3UHWUDLQLQJ6WHSV 7KRXVDQGV 3UHWUDLQLQJ6WHSV 7KRXVDQGV 3UHWUDLQLQJ6WHSV 7KRXVDQGV

(a) Validation Loss on BERT (b) Accuracy on MRPC (c) Accuracy on RTE

Figure 5. Performances of the models on unsupervised pre-training (BERT) and downstream tasks

BLEU score and validation loss of the Pre-LN Transformer (200k) which suggests the Pre-LN Transformer is easier
can achieve around 34 and 4, which are comparable with to optimize using larger learning rates. We also evaluate
the performance of the Post-LN Transformer. different model checkpoints on the downstream task MRPC
and RTE (more details can be found in the supplementary
Second, the Pre-LN Transformer converges faster than
material). The experiments results are plotted in Figure 5(b)
the Post-LN Transformer using the same lrmax . On the
and 5(c). We can see that the Pre-LN model also converges
IWSLT14 De-En task, the 9-th checkpoint of the Pre-LN
faster on the downstream tasks.
Transformer achieves nearly the same performance (vali-
dation loss/BLEU score) as 15-th checkpoint of the Post- As a summary, all the experiments on different tasks show
LN Transformer. Similar observations can be found in the that training the Pre-LN Transformer does not rely on the
WMT14 En-De task. learning rate warm-up stage and can be trained much faster
than the Post-LN Transformer.
Third, compared with RAdam, we find that the change of
the position of layer normalization “dominates” the change
of the optimizer. According to our experiments on the 5. Conclusion and Future Work
IWSLT14 De-En task, we can see that although RAdam
In this paper, we study why the learning rate warm-up stage
trains the Post-LN Transformer well without the warm-up
is important in training the Transformer and show that the
stage, it has little difference with Adam when training the
location of layer normalization matters. We show that in the
Pre-LN Transformer.
original Transformer, which locates the layer normalization
outside the residual blocks, the expected gradients of the
Unsupervised Pre-training (BERT) We record valida- parameters near the output layer are large at initialization.
tion loss of the model checkpoints and plot them in Figure This leads to an unstable training when using a large learning
5(a). Similar to the machine translation tasks, the learning rate. We further show that the Transformer which locates the
rate warm-up stage can be removed for the Pre-LN model. layer normalization inside the residual blocks, can be trained
The Pre-LN model can be trained faster. For example, the without the warm-up stage and converges much faster. In
Post-LN model achieves 1.69 validation loss at 500k updates the future, we will investigate other strategies of positioning
while the Pre-LN model achieves similar validation loss at the layer normalization and understand the optimization of
700k updates, which suggests there is a 40% speed-up rate. Transformer from a theoretical perspective.
Note that Twarmup (10k) is far less than the acceleration
On Layer Normalization in the Transformer Architecture

References Glorot, X. and Bengio, Y. Understanding the difficulty


of training deep feedforward neural networks. In Pro-
Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones,
ceedings of the thirteenth international conference on
L. Character-level language modeling with deeper self-
artificial intelligence and statistics, pp. 249–256, 2010.
attention. arXiv preprint arXiv:1808.04444, 2018.
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
Baevski, A. and Auli, M. Adaptive input representa- Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and
tions for neural language modeling. arXiv preprint He, K. Accurate, large minibatch sgd: Training imagenet
arXiv:1809.10853, 2018. in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Bahdanau, D., Cho, K., and Bengio, Y. Neural machine He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
translation by jointly learning to align and translate. 2017. ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., pp. 770–778, 2016.
and Magnini, B. The fifth PASCAL recognizing textual
entailment challenge. 2009. He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask r-
cnn. In Proceedings of the IEEE international conference
Child, R., Gray, S., Radford, A., and Sutskever, I. Gen- on computer vision, pp. 2961–2969, 2017.
erating long sequences with sparse transformers. arXiv
preprint arXiv:1904.10509, 2019. He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
Bag of tricks for image classification with convolutional
Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, neural networks. In Proceedings of the IEEE Conference
Q. V., and Salakhutdinov, R. Transformer-xl: Attentive on Computer Vision and Pattern Recognition, pp. 558–
language models beyond a fixed-length context. arXiv 567, 2019.
preprint arXiv:1901.02860, 2019.
Kingma, D. P. and Ba, J. Adam: A method for stochastic
Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Lan- optimization. arXiv preprint arXiv:1412.6980, 2014.
guage modeling with gated convolutional networks. In
International Conference on Machine Learning, pp. 933– Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senellart, J., and
941, 2017. Rush, A. Opennmt: Neural machine translation toolkit.
In Proceedings of the 13th Conference of the Associa-
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and tion for Machine Translation in the Americas (Volume 1:
Kaiser, Ł. Universal transformers. arXiv preprint Research Papers), volume 1, pp. 177–184, 2018.
arXiv:1807.03819, 2018.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Fed-
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: erico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C.,
Pre-training of deep bidirectional transformers for lan- Zens, R., et al. Moses: Open source toolkit for statistical
guage understanding. arXiv preprint arXiv:1810.04805, machine translation. In Proceedings of the 45th annual
2018. meeting of the association for computational linguistics
companion volume proceedings of the demo and poster
Dolan, W. B. and Brockett, C. Automatically constructing a sessions, pp. 177–180, 2007.
corpus of sentential paraphrases. In Proceedings of the
Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Penning-
International Workshop on Paraphrasing., 2005.
ton, J., and Sohl-Dickstein, J. Deep neural networks as
Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient gaussian processes. arXiv preprint arXiv:1711.00165,
methods for online learning and stochastic optimization. 2017.
Journal of Machine Learning Research, 12(Jul):2121– Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-
2159, 2011. Dickstein, J., and Pennington, J. Wide neural networks of
any depth evolve as linear models under gradient descent.
Edunov, S., Ott, M., Auli, M., and Grangier, D. Un-
arXiv preprint arXiv:1902.06720, 2019.
derstanding back-translation at scale. arXiv preprint
arXiv:1808.09381, 2018. Lei Ba, J., Kiros, J. R., and Hinton, G. E. Layer normaliza-
tion. arXiv preprint arXiv:1607.06450, 2016.
Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin,
Y. N. Convolutional sequence to sequence learning. In Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and
International Conference on Machine Learning, pp. 1243– Han, J. On the variance of the adaptive learning rate and
1252, 2017. beyond. arXiv preprint arXiv:1908.03265, 2019a.
On Layer Normalization in the Transformer Architecture

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner,
Roberta: A robustly optimized bert pretraining approach. N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit,
arXiv preprint arXiv:1907.11692, 2019b. J. Tensor2tensor for neural machine translation. CoRR,
abs/1803.07416, 2018. URL http://arxiv.org/
Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., abs/1803.07416.
and Liu, T.-Y. Understanding and improving transformer
from a multi-particle dynamic system point of view. arXiv Wainwright, M. J. High-dimensional statistics: A non-
preprint arXiv:1906.02762, 2019. asymptotic viewpoint, volume 48. Cambridge University
Nguyen, T. Q. and Salazar, J. Transformers without tears: Press, 2019.
Improving the normalization of self-attention. arXiv
preprint arXiv:1910.05895, 2019. Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F.,
and Chao, L. S. Learning deep transformer models for
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a machine translation. arXiv preprint arXiv:1906.01787,
method for automatic evaluation of machine translation. 2019.
In Proceedings of the 40th annual meeting on association
for computational linguistics, pp. 311–318. Association Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and
for Computational Linguistics, 2002. Pennington, J. Dynamical isometry and a mean field
theory of cnns: How to train 10,000-layer vanilla convo-
Popel, M. and Bojar, O. Training tips for the transformer lutional neural networks. In International Conference on
model. The Prague Bulletin of Mathematical Linguistics, Machine Learning, pp. 5389–5398, 2018.
110(1):43–70, 2018.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Yang, G. Scaling limits of wide neural networks with
Sutskever, I. Language models are unsupervised multitask weight sharing: Gaussian process behavior, gradient in-
learners. 2019. dependence, and neural tangent kernel derivation. arXiv
preprint arXiv:1902.04760, 2019.
Ruder, S. An overview of gradient descent optimization
algorithms. arXiv preprint arXiv:1609.04747, 2016. Yang, G., Pennington, J., Rao, V., Sohl-Dickstein, J., and
Schoenholz, S. S. A mean field theory of batch normal-
Sennrich, R., Haddow, B., and Birch, A. Neural machine ization. arXiv preprint arXiv:1902.08129, 2019a.
translation of rare words with subword units. arXiv
preprint arXiv:1508.07909, 2015. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
Sennrich, R., Haddow, B., and Birch, A. Neural machine R., and Le, Q. V. Xlnet: Generalized autoregressive
translation of rare words with subword units. In ACL, pretraining for language understanding. arXiv preprint
2016. arXiv:1906.08237, 2019b.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se- You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer,
quence learning with neural networks. In Advances in K. Imagenet training in minutes. In Proceedings of the
neural information processing systems, pp. 3104–3112, 47th International Conference on Parallel Processing, pp.
2014. 1. ACM, 2018.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Zeiler, M. D. Adadelta: an adaptive learning rate method.
Z. Rethinking the inception architecture for computer arXiv preprint arXiv:1212.5701, 2012.
vision. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 2818–2826, Zhang, H., Dauphin, Y. N., and Ma, T. Fixup initialization:
2016. Residual learning without normalization. arXiv preprint
Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop, coursera: arXiv:1901.09321, 2019.
Neural networks for machine learning. University of
Toronto, Technical Report, 2012. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta-
sun, R., Torralba, A., and Fidler, S. Aligning books
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, and movies: Towards story-like visual explanations by
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- watching movies and reading books. In arXiv preprint
tion is all you need. In Advances in neural information arXiv:1506.06724, 2015.
processing systems, pp. 5998–6008, 2017.
On Layer Normalization in the Transformer Architecture

A. Experimental Settings using Moses (Koehn et al., 2007) and apply BPE(Sennrich
et al., 2016). We randomly split documents into one training
A.1. Machine Translation set and one validation set. The training-validation ratio for
Experiment on Section 3 The training/validation/test pre-training is 199:1. All experiments are conducted on 32
sets of the IWSLT14 German-to-English (De-En) task con- NVIDIA Tesla P40 GPUs.
tain about 153K/7K/7K sentence pairs, respectively. We The base model in Devlin et al. (2018) consists of 12 Trans-
use a vocabulary of 10K tokens based on a joint source and former layers. The size of hidden nodes and embeddings
target byte pair encoding (BPE) (Sennrich et al., 2015). All are set to 768, and the number of heads is set to 12.
of our experiments use a Transformer architecture with a
6-layer encoder and 6-layer decoder. The size of embedding
A.3. GLUE Dataset
is set to 512, the size of hidden nodes in attention sub-layer
and position-wise feed-forward network sub-layer are set to MRPC The Microsoft Research Paraphrase Corpus
512 and 1024, and the number of heads is set to 4. Label (Dolan & Brockett, 2005) is a corpus of sentence pairs auto-
smoothed cross entropy is used as the objective function by matically extracted from online news sources, with human
setting  = 0.1 (Szegedy et al., 2016), and we apply dropout annotations for whether the sentences in the pair are seman-
with a ratio 0.1. The batch size is set to be 4096 tokens. tically equivalent, and the task is to predict the equivalence.
When we decode translation results from the model during The performance is evaluated by the accuracy.
inference, we set beam size as 5 and the length penalty as
1.2. RTE The Recognizing Textual Entailment (RTE) datasets
come from a series of annual textual entailment challenges
(Bentivogli et al., 2009). The task is to predict whether sen-
Experiment on Section 4 The configuration of IWLST14
tences in a sentence pair are entailment. The performance is
De-En task is the same as in Section 31 . For the WMT14
evaluated by the accuracy.
En-De task, we replicate the setup of (Vaswani et al., 2017),
which consists of about 4.5M training parallel sentence
pairs, and uses a 37K vocabulary based on a joint source Fine-tuning on GLUE tasks We use the validation set for
and target BPE. Newstest2013 is used as the validation set, evaluation. To fine-tune the models, following Devlin et al.
and Newstest2014 is used as the test set. One of the basic (2018); Liu et al. (2019b), we search the optimization hyper-
configurations of the Transformer architecture is the base parameters in a search space including different batch sizes
setting, which consists of a 6-layer encoder and 6-layer (16/32), learning rates (1e−5 - 1e−4 ) and number of epochs
decoder. The size of the hidden nodes and embeddings are (3-8). We find that the validation accuracy are sensitive to
set to 512. The number of heads is 8. Label smoothed random seeds, so we repeat fine-tuning on each task for 6
cross entropy is used as the objective function by setting times using different random seeds and compute the 95%
 = 0.1. The batch size is set to be 8192 tokens per GPU confidence interval of validation accuracy.
on 16 NVIDIA Tesla P40 GPUs.
B. Proof of Lemma 1
A.2. Unsupervised Pretraining
Proof. Denote X = (X1 , X2 , ..., Xd ) in which Xi
We follow Devlin et al. (2018) to use English Wikipedia are i.i.d. Gaussian random variables with distribu-
corpus and BookCorpus for the pre-training. As the dataset tion N (0, σ 2 ). Denote ρX (x) as the probability den-
BookCorpus (Zhu et al., 2015) is no longer freely distributed. sity function of X1 . Then E(kReLU(X)k22 ) =
We follow the suggestions from Devlin et al. (2018) to crawl Pd 2
Pd 2
i=1 E[ReLU(Xi ) ] = i=1 E[ReLU(Xi ) |Xi ≥
and collect BookCorpus2 on our own. The concatenation d
0]P(Xi ≥ 0) = 2 E[ReLU(X1 ) |X1 ≥ 0] = 2
of two datasets includes roughly 3.4B words in total, which d
R +∞
2
2 E[X1 |X1 ≥ 0] = d2 −∞ x2 ρX|X>0 (x)dx =
is comparable with the data corpus used in Devlin et al.
d +∞ 2
x 2ρX (x)dx = 21 σ 2 d.
R
(2018). We first segment documents into sentences with 2 0
Spacy3 ; Then, we normalize, lower-case, and tokenize texts
1
The Pre-LN Transformer can get state-of-the-art performance
C. Proof of Lemma 2
(35.5 test BLEU) on the IWSLT14 DE-EN task by setting initial
Proof. At initialization, the layer normalization is computed
learning rate to be 7.5e−4 and decaying it at the 8000 update steps
followed by the inverse square root learning rate scheduler. The as LN(v) = v−µ σ . It is easy to see that layer normalization
dropout is set to be 0.3, attention dropout is set to be 0.1. The at initialization projects any vector v onto Pthe d−1-sphere of
batch size is set to be 8192.
√ 2 v−µ 2
d
(v −µ)2
2
radius d since kLN(v)k2 = k σ k2 = k=1σ2k = d.
https://www.smashwords.com
3
https://spacy.io We first estimate the expected l2 norm of each intermediate
On Layer Normalization in the Transformer Architecture

output xpost,1
l,i , · · · , xpost,5
l,i for l > 0. Using Xavier initial- Pre-LN Transformer.
ization, the elements in W V,l are i.i.d. Gaussian random
variables sampled from N (0, 1/d). Since kxpost 2 E(kxpre,3 k22 ) =E(kxpre 2 pre,2 2
l,i k2 ) + E(kxl,i k2 )
l,i k2 = d by l,i
the definition of Layer Normalization when l > 0, we have + 2E(xpre,2 xpre )
>
(14)
l,i l,i

E(kxpost,2
l,i k22 ) =E(kxpost 2 post,1 2
l,i k2 ) + E(kxl,i k2 ) =E(kxpre 2 pre,2 2
l,i k2 ) + E(kxl,i k2 )
> n
+ 2E(xpost,1
l,i xpost
l,i ) (2) +
2 X pre,1 V,l pre >
E( x W xl,i ) (15)
n j=1 l,j
=E(kxpost 2 post,1 2
l,i k2 ) + E(kxl,i k2 )
n
2 X post V,l post > =E(kxpre 2 pre,2 2
l,i k2 ) + E(kxl,i k2 ) (16)
+ E( x W xl,i ) (3) n
n j=1 l,j 1 X
=E(kxpre 2
l,i k2 ) + E(k xpre,1
l,i k22 ) (17)
n
=E(kxpost 2
l,i k2 ) + E(kxpost,1
l,i k22 ) (4) i=1
n
1 X post 2 It is easy to see that we have E(kxpre 2 pre,3 2
=E(kxpost l,i k2 ) ≤ E(kxl,i k2 ) ≤
2
l,i k2 ) + E(k x k ) (5)
n i=1 l,i 2 E(kxpre k 2
) + d. And similar to (10)-(12),
l,i 2
≤2d (6)
E(kxpre 2 pre,3 2
l+1,i k2 ) =E(kxl,i k2 ) + E(kxpre,5
l,i k22 )
and E(kxpost,2 k22 ) = E(kxpost 2 post,1 2
l,i k2 ) + E(kxl,i k2 ) = >
l,i
post 2 1
Pn post 2 post 2 + 2E(xpre,3
l,i xpre,5
l,i ) (18)
E(kxl,i k2 ) + E(k n i=1 xl,i k2 ) ≥ E(kxl,i k2 ) = d.
=E(kxpre,3
l,i k22 ) + E(kxpre,5
l,i k22 ) (19)
Similarly, we have kxpost,3
l,i k22 = d by the definition of Layer 1
Normalization. Again, for the ReLU activation function, =E(kxpre,3
l,i k22 ) + d (20)
2
the elements in W 1,l and W 2,l are i.i.d. Gaussian random
variables sampled from N (0, 1/d). According to Lemma 1,
we have Combining both, we have E(kxpre 2 1
l,i k2 ) + 2 d ≤
pre 2 pre 2 3
E(kxl+1,i k2 ) ≤ E(kxl,i k2 ) + 2 d. Then we have (1 +
E(kxpost,4 k22 ) =E(kReLU(xpost,3 W 1,l )W 2,l k22 ) (7) l pre 2 3l
l,i l,i 2 )d ≤ E(kxl,i k2 ) ≤ (1 + 2 )d by induction.
=E(E(E(kReLU(xpost,3 l,i W 1,l )W 2,l k22
|xpost,3
l,i , W 1,l )|xpost,3
l,i )) (8)
post,3 1,l 2 post,3
=E(E(kReLU(xl,i W )k2 |xl,i )) D. Proof of Lemma 3
(9) The proof of Lemma 3 is based on Lemma 4.1:
1 d Lemma 4. Let α ∈ Rd be a vector such that kαk2 = 1, then
=E( kxpost,3
l,i k22 ) = (10)
2 2 the eigenvalue of I − α> α is either 1 or 0.

Based on this, we can estimate the scale of E(kxpost,5


l,i k22 )
Proof. Let {e1 , ..., ed } be unit vectors such that e1 = α and
as follows.
ei ⊥ej for all (i, j). Then we have e1 (I − α> α) = e1 −
E(kxpost,5 k22 ) =E(kxpost,3 k22 ) + E(kxpost,4 k22 ) e1 α> α = e1 − α = 0 and ei (I − α> α) = ei − ei α> α = ei
l,i l,i l,i
for i 6= 1. So ei are all the eigenvectors of I − α> α, and
>
+ 2E(xpost,3
l,i xpost,4
l,i ) (11) their corresponding eigenvalues are (0, 1, 1, ..., 1). Hence
we complete our proof.
=E(kxpost,3
l,i k22 ) + E(kxpost,4
l,i k22 )
n
2 X >
+ E( ReLU(xpost,3
l,j W 1,l )W 2,l xpost,3
l,i ) Proof of Lemma 3. Denote y = x(I − d1 1> 1), where 1 =
n j=1
(1, 1, ..., 1) ∈ Rd , then the layer normalization can be rewrit-
(12) ten as
d 3 yi
=E(kxpost,3
l,i k22 ) + E(kxpost,4
l,i k22 ) = d + = d LN(x)i = q P (21)
2 2 1 d 2
(13) d j=1 yj

Using similar technique we can bound E(kxpre 2


l,i k2 ) for the We explicitly calculate the Jacobian of layer normalization
On Layer Normalization in the Transformer Architecture

as Proof of Theorem 1. We prove Theorem 1 by estimating


∂LN(x)i ∂ yi each element of the gradient matrix. Namely, we will an-
= (q P ) (22) alyze ∂W∂ L̃
2,L for p, q ∈ {1, ..., d}. The loss of the post-LN
∂yj ∂yj 1 n
yk2 pq
d k=1 Transformer can be written as
q P 1
1 n y n
δij yk2 − yi √ 1 P
k=1
d j
1X
d
L̃(xpost post
L(xpost
n 2
k=1 yk
= d
(23) L+1,1 , ..., xL+1,n ) = L+1,i ) (33)
1
Pn 2 n i=1
d k=1 yk

√ δij kyk22 −yi yj d yi yj
= d 3 = (δij − 2)
Through back propagation, for each i ∈ {1, 2, ..., n} the gra-
kyk 2 kyk 2 kyk 2 dient of L(xL+1,i ) with respect to the last layer’s parameter
2
(24) W 2,L in the post-LN setting can be written as:
where δij = 1 when i = j and δij = 0 when i 6= j. In the
matrix form, ∂L(xpost ∂L(xpost post post,5 post,4
L+1,i ) L+1,i ) ∂xL+1,i ∂xL,i ∂xL,i
√ =
∂LN(x) d y> y 2,L
∂Wpq ∂xpost ∂xpost,5 ∂xpost,4 2,L
∂Wpq
= (I − ) (25) L+1,i L,i L,i
∂y kyk2 kyk22 (34)
and ∂L(xpost
L+1,i ) ∂xpost,4
L,i
= JLN (xpost,5
L,i ) (35)
∂LN(x) ∂xpost
L+1,i
2,L
∂Wpq
JLN (x) = (26)
∂x ∂L(xpost
L+1,i )
∂LN(x) ∂y = JLN (xpost,5
L,i )(0, 0, ...,
= (27) ∂xpost
L+1,i
∂y ∂x
√ 1 y> y 1 [ReLU(xpost,3
L,i W 1,L )]p , ..., 0)> (36)
= d (I − 2 )(I − 1> 1). (28)
kyk2 kyk2 d
>
Here [ReLU(xpost,3
L,i W 1,L )]p means the p-th element of
y y 1 >
Since the eigenvalue of the matrix (I − kyk 2 ) and (I − d 1 1) ∂L(xpost
L+1,i )
2
>
ReLU(xpost,3
L,i W 1,L ). So the absolute value of 2,L
∂Wpq
y y
are either 1 or 0 (by Lemma 4.1), we have k(I − kyk22
)k2 = can be bounded by
O(1) and k(I − d1 1> 1)k2 = O(1). So the spectral norm of
JLN (x) is
√ √ ∂L(xpost
L+1,i ) ∂L(xpost
L+1,i )
d d | |≤k k2 kJLN (xpost,5
L,i )k2
kJLN (x)k2 = O( ) = O( ) (29) 2,L
∂Wpq ∂xpost
L+1,i
kyk2 kxk2
k(0, 0, ..., [ReLU(xpost,3
L,i W 1,L )]p , ..., 0)> k2
(37)
∂L(xpost
L+1,i )
E. Proof of Theorem 1 =k k2 kJLN (xpost,5
L,i )k2
∂xpost
L+1,i
The proof of Theorem 1 is based on Lemma 4.2: |[ReLU(xpost,3 W 1,L )]p | (38)
L,i
Lemma 5. Let Y be a random variable that is never larger
than B. Then for all a < B, which implies

E[B − Y ] ∂L(xpost
L+1,i ) ∂L(xpost
L+1,i )
Pr[Y ≤ a] ≤ (30) | |2 ≤k k22 kJLN (xpost,5
L,i )k22
B−a
2,L
∂Wpq ∂xpost
L+1,i

Proof. Let X = B − Y , then X ≥ 0 and Markov’s inequal- |[ReLU(xpost,3


L,i W 1,L )]p |2 (39)
ity tells us that Since all the derivatives are bounded, we have
∂L(xpost
L+1,i ) 2
E[X] k ∂xpost k2 = O(1). So
Pr[X ≥ B − a] ≤ (31) L+1,i
B−a
∂L(xpost
L+1,i )
Hence | |2
E[B − Y ] 2,L
∂Wpq
Pr[Y ≤ a] ≤ (32)
B−a h i
=O( kJLN (xpost,5
L,i )k22 |[ReLU(xpost,3
L,i W 1,L )]p |2 ) (40)
On Layer Normalization in the Transformer Architecture

Since kxpost,3
L,i
post,3
k22 = d, [xL,i W 1,L ]p has distribution W 2,L can be written as
N (0, 1), using Chernoff bound we have
∂L(xpre
F inal,i ) ∂L(xpre pre pre pre,5
F inal,i ) ∂xF inal,i ∂xL+1,i ∂xL,i
a20 =
Pr[|[xpost,3 W 1,L ]p |≥ a0 ] ≤ exp(− ).
2,L
∂Wpq ∂xpre
F inal,i ∂xpre pre,5
L+1,i ∂xL,i
2,L
∂Wpq
L,i
2 (48)
So ∂L(xpre
F inal,i )
= JLN (xpre
L+1,i )(0, 0, ...,
0.01 ∂xpre
F inal,i
Pr[ReLU([xpost,3
L,i W 1,L ]p )2 ≥ 2 ln 100d] ≤ .
d [ReLU(xpre,4 1,L
)]p , ..., 0)>
L,i W (49)
Thus with probability at least 0.99, for all p = 1, 2, ..., d we
have ReLU([xpost,3 W 1,L ]p )2 ≤ 2 ln 100d. So the absolute value of each component of the gradient is
L,i
bounded by
|kxpost,5
L,i k22 −Ekxpost,5
L,i k22 |
Since with probability 1 − δ(), ≤ ,
Ekxpost,5
L,i k22 ∂L(xpre
F inal,i ) ∂L(xpre
F inal,i )
kxpost,5 k22 ≤ post,5 2
)EkxL,i k2 . Using | |≤k k2 kJLN (xpre
L+1,i )k2
we have L,i (1 + Lemma 2,L
∂Wpq ∂xpre
F inal,i
4.2, we have
k(0, 0, ..., [ReLU(xpre,4
L,i W
1,L
)]p , ..., 0)k2
Pr[kxpost,5
L,i k22 ≤α0 Ekxpost,5
L,i k22 ] (41) (50)
(1 + )Ekxpost,5 k22 −Ekxpost,5 k22 ∂L(xpre
F inal,i )

L,i L,i
(42) =k k2 kJLN (xpre
L+1,i )k2
(1 +  − α0 )Ekxpost,5 k22 ∂xpre
F inal,i
L,i
 |[ReLU(xpre,4
L,i W
1,L
)]p | (51)
= (43)
1 +  − α0
for an arbitrary constant α0 > 0, which equals Since kxpre,4 2 pre,4
L,i k2 = d and [xL,i W
1,L
]p obeys distribution
N (0, 1), using Chernoff bound we have

Pr[kxpost,5
L,i k22 ≥ α0 Ekxpost,5
L,i k22 ] ≥ 1 − (44)
1 +  − α0 a20
Pr[|[xpre,4
L,i W
1,L
]p |≥ a0 ] ≤ exp(− ).
So according to union bound, with probability at 2
 ∂L(xpost )
least 0.99 − δ() − 1+−α 0
we have | ∂WL+1,i 2,L |2 = So
h i pq
O( kJLN (xpost,5 )k22 |[ReLU(xpost,3 W 1,L )]p |2 ) ≤ 0.01
L,i L,i
Pr[ReLU([xpre,4
L,i W
1,L
]p )2 ≥ 2 ln 100d] ≤ .
2d ln 100d
O( kx post,5 2 ) ≤ d ln d
O( α Ekx post,5 2 ) = O( ln d d
k2 0 k2 α0 ). So we have
L,i L,i
So with probability at least 0.99, for all p = 1, 2, ..., d we
post
n
∂ L̃ 2 1 X ∂L(xL+1,i ) 2 have ReLU([xpre,4
L,i W
1,L
]p )2 ≤ 2 ln 100d.
| 2,L
| =| 2,L
| (45)
∂Wpq n i=1 ∂Wpq |kxpre 2 pre 2
L+1,i k2 −EkxL+1,i k2 |
Since with probability 1 − δ(), Ekxpre k 2 ≤ ,
n post L+1,i 2
1 X ∂L(xL+1,i ) 2 ln d we have kxpre 2 pre
)EkxL+1,i k22 . Using Lemma 5,
≤ | 2,L
| = O( ) (46) L+1,i k2 ≤ (1 +
n i=1 ∂Wpq α0 we have

and Pr[kxpre 2 pre 2


L+1,i k2 ≤α0 EkxL+1,i k2 ] (52)
+ )Ekxpre 2 pre 2
L+1,i k2 −EkxL+1,i k2
v
u d s (1
∂ L̃ uX ∂ L̃ d2 ln d ≤ (53)
k kF = t | 2,L
|2 = O( ) (1 +  − α0 )Ekxpre 2
L+1,i k2
∂W 2,L p,q=1 ∂Wpq
α0

= (54)
. 1 +  − α0
The loss of the pre-LN Transformer can be written as which equals
n
1 
Pr[kxpre 2 pre 2
X
L̃(xpre pre
F inal,1 , ..., xF inal,n ) = L(xpre
F inal,i ) (47) L+1,i k2 ≥ α0 EkxL+1,i k2 ] ≥ 1 − (55)
n 1 +  − α0
i=1
According to union bound, with probability
Using the same technique, in the pre-LN setting the gradient ∂L(xpre ) 2

of L(xpre
F inal,i ) with respect to the last layer’s parameter
0.99 − δ() − 1+−α0 we have | ∂WF inal,i
2,L
pq
| =
On Layer Normalization in the Transformer Architecture
h i
O( kJLN (xpre )k
L+1,i 2
2
|[ReLU(x pre,4
L,i W 1,L
)]p |2
) ≤
2d ln 100d d ln d
O( kx pre 2) ≤ O( α0 Ekx pre 2) = O( αln0 L
d
). So we have
L+1,i k2 L+1,i k2
JLN (xpost,2
 
j,1 )
n pre ∂xpost,3
j ..
∂ L̃ 1 X ∂L(xF inal,i ) 2 ln d =

.

| 2,L
|2 = | 2,L
| = O( ) (56) ∂xpost,2

∂Wpq n i=1 ∂Wpq α0L j
JLN (xpost,2 )
j,n
r q (59)
∂ L̃
Pd ∂ L̃ 2
Thus k ∂W 2,L kF = p,q=1 | ∂Wpq2,L |
2 ≤ O( dα0lnLd ). 
I
  1 V,j
··· 1 V,j

nW nW
∂xpost,2
j .. .. .. ..
= +
   
1 . . . .
Take α0 = 10 , we have that with probability at least ∂xpost

j 1 V,j 1 V,j
0.99 − δ() − 0.9+ 
, for the Post-LN Transformer we have I n W ··· nW
∂ L̃
√ (60)
k ∂W 2,L kF ≤ O(d ln d) and for the Pre-LN Transformer
q
∂ L̃ ln d
we have k ∂W 2,L kF ≤ O(d L )
Using Hölder’s inequality, we have
F. Extension to other layers
∂xpost
For simplicity, we denote xl = Concat(xl,1 , ..., xl,n ) ∈ Ek
j+1
k2
Rnd and xkl = Concat(xkl,1 , ..., xkl,n ) ∈ Rnd for k = ∂xpost
j
{1, 2, 3, 4, 5}. Then in the Post-LN Transformer, the gra- "
∂xpost ∂xpost,5 ∂xpost,3 ∂xpost,2
#
j+1 j j j
dient of the parameters in the l-th layer (take W 2,l as an ≤E k post,5 k2 k post,3 k2 k post,2 k2 k k2
example) can be written as ∂xj ∂xj ∂xj ∂xpost
j
(61)
∂ L̃ ∂ L̃
L
Y ∂xpost
j+1 ∂xl+1
post v "
∂xpost,5 ∂xpost,3 ∂xpost,2
# " #
= ( ) ,
u
∂W 2,l ∂xpost post
∂W 2,l
u ∂xj+1 2 j j j
L+1 j=l+1 ∂xj
2
≤ E k post,5 k2 E k post,3 k2 k post,2 k2 k
t 2 2
k2
∂xj ∂xj ∂xj ∂xpost
j
where (62)
∂xpost
j+1 ∂xpost
j+1 ∂xpost,5
j ∂xpost,3
j ∂xpost,2
j
= .
∂xpost
j ∂xpost,5
j ∂xpost,3
j ∂xpost,2
j ∂xpost
j Since
∂xj+1
= diag(JLN (xpost,5 ), ..., JLN (xpost,5 )),
∂xpost,5
j
j,1 j,n
s 
The Jacobian matrices of the Post-LN Transformer layers
 q
∂xpost
q
d 2
are:
j+1
we have E k ∂xpost,5 k22 ≈ E kxpost,5 k2
≈ 3 when
j j,1 2

JLN (xpost,5 kxpost,5 concentrates around its expectation Ekxpost,5


k22 k22
 
post j,1 ) j,1 j,1
∂xj+1 .. 3
which equals 2 d according to Lemma 2. Therefore, when
post,5 = 
 
. 
∂xj we estimate the norm of ∂W ∂ L̃
2,l for post-LN transformer,
post,5
JLN (xj,n )
(L−l)/2
(57) there exists a term O( 32 ), which exponentially de-
   2,j
 creases as l goes smaller. Similarly, in the pre-LN Trans-
post,5 I W
∂xj former, the gradient can be written as
=
 .. +
  .. 
. .
∂xpost,3

j 2,j
I W
 j  1,l
 ∂ L̃ ∂ L̃ ∂xpre
F inal
L
Y ∂xpre pre
j+1 ∂xl+1
J1 W = pre pre ( pre ) ,
∂W 2,l ∂xF inal ∂xL+1 ∂xj ∂W V,l
 ..  ..  j=l+1
 .  . 
Jjn W 1,l
(58) where

where ∂xpre
j+1 ∂xpre pre,3
j+1 ∂xj
pre = .
∂xj ∂xpre,3 ∂xpre
j
Jji j
   >    > 
= diag σ 0
xpost,3
j,i w11,j , ..., σ 0
xpost,3
j,i wd1,j
The Jacobian matrices of the Pre-LN Transformer layers
∈ Rd×d are:
On Layer Normalization in the Transformer Architecture

and from Example 2.5 in (Wainwright, 2019), the χ2



I
 
W 2,j
 variable
√ Y is sub-exponential with parameters (ν, α) =
∂xpre
j+1 .. .. (2 n, 4). So we can derive the one-sided bound
= +
   
. .
∂xpre,3

j
I W 2,j P [Y − n ≥ n] ≤ exp(−n2 /8),for all  ∈ (0, 1)
 0
(h )  (66)
W 1,j

J1 So Y is (, δ)-bounded with  ∈ (0, 1) and δ =
 ..  .. exp(−n2 /8).


 . 
 .
(h0 ) W 1,j
Jn
H. Small learning rate experiment
JLN (xpre,3
 
j,1 )
.. Theoretically, we find that the gradients of the parameters
 (63)
 
 . near the output layers are very large for the Post-LN Trans-
JLN (xpre,3
j,n ) former and suggest using large learning rates to those pa-
   1 V,j 1 V,j
 rameters makes the training unstable. To verify whether
pre,3 I n W · · · n W using small-step updates mitigates the issue, we use a very
∂xj . . . .  small but fixed learning rate and check whether it can op-
=
 . . +
  .
. . . .
.
∂xpre

j 1 V,j 1 V,j timize the Post-LN Transformer (without the learning rate
I n W · · · n W
warm-up step) to a certain extent. In detail, we use a fixed
JLN (xpre
 
j,1 ) learning rate of 1e−4 at the beginning of the optimization,
 .. 
(64) which is much smaller than the lrmax = 1e−3 in the pa-
 . 
JLN (xpre per. Please note that as the learning rates during training
j,n )
are small, the training converges slowly, and this setting
If l is sufficiently large, the norm of JLN (xpre j,i ) and is not very practical in real large-scale tasks. We plot the
pre,3 1 validation curve together with other baseline approaches in
JLN (xj,i ) are very small (of order O( j )) as j is be-

tween l + 1 and L, which means the eigenvalues of matrix Figure 6. We can see from the figure, the validation loss
∂xpre
j+1 ∂x pre,3 (pink curve) is around 4.3 in 27 epochs. This loss is much
∂xpre,3
and ∂xjpre are close to 1. Then we can see that lower than that of the Post-LN Transformer trained using a
j j
∂xpre
j+1 ∂xpre,3
j large learning rate (blue curve). But it is still worse than the
Ek ∂xpre,3 k2 and Ek ∂xpre k2 are nearly 1, and the norm of
j j SOTA performance (green curve).
∂ L̃
∂W 2,l
for pre-LN transformer is independent of l when l is
large.

G. Examples of (, δ)-bounded random


variables
In this section we give an example of (, δ)-bounded ran-
dom variable. This example comes from Example 2.5 in
(Wainwright, 2019) and we give a short description below.
If Z = (Z1 , ..., Zn ) is a Gaussian
Pn vector with distribution
2 2
N (0, In ), then
Pn Y = kZk 2 = Z
k=1 k has distribution χ2n .
2
And EY = k=1 EZk = n
A random variable X with mean µ = E[X] is called sub-
exponential if there are non-negative parameters (ν, α) such
2 2
that E[exp(λ(X − µ))] ≤ exp( ν 2λ ) for all |λ|< α1 . The
next proposition comes from Proposition 2.2 in (Wainwright,
2019).
Proposition 1 (Sub-exponential tail bound). Suppose that
X is sub-exponential with parameters (ν, α). Then
(
t2 2
exp(− 2ν 2) if 0 ≤ t ≤ να , and
P[X − µ ≥ t] ≤ t 2
exp(− 2α ) for t > να
(65)
On Layer Normalization in the Transformer Architecture



9DOLGDWLRQ/RVV



%/(8






                           
(SRFKV (SRFKV
lrmax = 1e 3ZRZDUPXS lrmax = 1e 3Twarmup = 500 lrmax = 1e 3Twarmup = 4000
lrmax = 5e 4ZRZDUPXS lrmax = 5e 4Twarmup = 500 lrmax = 5e 4Twarmup = 4000
lrmax = 1e 4ZRZDUPXS

Figure 6. Performances of the models on the IWSLT14 De-En task.

You might also like