0% found this document useful (0 votes)
60 views7 pages

Exploring Low Rank Training of Deep Neural Networks

This document explores techniques for training deep neural networks in low rank, or with factorized layers. It questions common beliefs and provides evidence that techniques like SVD initialization and modified L2 regularization are effective for low rank training. Extensive experiments are performed on models like GPT-2 to analyze the dynamics of singular values during training and how factors interact with regularization. The document aims to further understanding of effective low rank training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views7 pages

Exploring Low Rank Training of Deep Neural Networks

This document explores techniques for training deep neural networks in low rank, or with factorized layers. It questions common beliefs and provides evidence that techniques like SVD initialization and modified L2 regularization are effective for low rank training. Extensive experiments are performed on models like GPT-2 to analyze the dynamics of singular values during training and how factors interact with regularization. The document aims to further understanding of effective low rank training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Exploring Low Rank Training of Deep Neural Networks

Siddhartha Rao Kamalakara * 1 2 Acyr Locatelli * 2 Bharat Venkitesh * 1 Jimmy Ba 3 Yarin Gal 4
Aidan N. Gomez 1 2 4

Abstract about why techniques like singular value decomposition


(SVD) based initialisation and modified L2 regularisation
Training deep neural networks in low rank, i.e.
arXiv:2209.13569v1 [cs.LG] 27 Sep 2022

are effective. We start with SVD based initialisation tech-


with factorised layers, is of particular interest to
niques which have been found to be effective in both low-
the community: it offers efficiency over unfac-
rank and sparsity literature (Lee et al., 2019). We look to
torised training in terms of both memory con-
random matrix theory to formally define the distribution of
sumption and training time. Prior work has fo-
singular values at initialisation in modern neural networks
cused on low rank approximations of pre-trained
and challenge prior assumptions on their importance. We
networks and training in low rank space with addi-
reveal novel empirical insights about the dynamics of sin-
tional objectives, offering various ad hoc explana-
gular values during training of an L2 regularised network
tions for chosen practice. We analyse techniques
and present a hypothesis about why L2 regularisation on
that work well in practice, and through extensive
the re-composed matrix works better than L2 regularisation
ablations on models such as GPT2 we provide
on its factors. We also investigate currently held beliefs
evidence falsifying common beliefs in the field,
about effective step size and its correlation with perfor-
hinting in the process at exciting research oppor-
mance. Moreover, we analyse and present experiments with
tunities that still need answering.
pre-training as a strategy to train better performing low-rank
networks. We present a wide array of experiments to sup-
port our arguments and to demonstrate the effectiveness and
1. Introduction practicality of training low-rank neural networks.
Recent developments in training very large vision and lan-
guage models (Brown et al., 2020; Fedus et al., 2021; Doso- 38 Baseline
Spectral
vitskiy et al., 2020) have led to an increasing need for ef- 37 Spectral Ones
ficient training paradigms. Low rank matrix factorisation
36
of layers in a deep neural network can offer significant
Perplexity

training speedups (up to 2x) and consumes less memory 35


when compared to its unfactorised counterpart. While ma- 34
trix factorisation has been studied extensively in the context 33
of linear networks and their applications to matrix sensing
and matrix completion problems, the effects of factorised 32
layers on optimisation are non-trivial. Hence, prior work 31
in this space predominantly focused on low-rank training 100 150 200 250 300 350
TPU hours
with additional training objectives, or involved computing
factorised approximations post-training. There has been
Figure 1. TPU Compute hours vs Performance of GPT-2 on LM1B
limited prior work that focused on training dynamics for as the model is scaled up. Each point on the line corresponds to a
low rank deep neural networks. different model size starting from 1024 hidden dimensions (on the
Our contributions: we examine the recent developments top left) to 2560 (in the bottom right) with increments of 256.
in training low rank networks and question existing beliefs
* 1
Equal contribution Cohere, Inc., Toronto 2 FOR.ai
3
Department of Computer Science, University of Toronto, Toronto,
2. Background
Canada 4 Department of Computer Science, University of Oxford, Most works in the low rank space that focus on efficiency
United Kingdom. Correspondence to: Siddhartha Rao Kamalakara
<[email protected]>. and speedups looked at post-hoc approximation of trained
networks. (Yu et al., 2017) took an SVD free approach to
reconstruct feature maps by minimising an objective that
Exploring Low Rank Training of Deep Neural Networks

imposes sparse low rank structure. (Jaderberg et al., 2014) vides speedups across all our experiments while ensuring
also considered a trained network upon which a low rank minimal activation memory overhead.
structure is imposed through filter and data reconstruction
Consider the difference between the vanilla gradient descent
objectives. (Tai et al., 2016) focused on low rank training of
update (unfactorised) Wt+1 = Wt − α∇W and the update
CNNs from scratch; they proposed a horizontal and vertical
performed in the factorised setting:
filter decomposition of a convolutional kernel and reproject
into orthogonal vectors at every step. One of the reasons >
Wt+1 = Ut+1 Vt+1
why prior work has focused on post-training low rank ap-
proximations is that training dynamics of neural networks
are poorly understood. Moreover, it has been found that
naively training in the low rank space from scratch suffers a Wt+1 = (Ut − α∇U )(Vt − α∇V )>
gap in performance – section 4. To resolve this to an extent, Wt+1 = Wt − α (∇Wt Vt Vt> + Ut Ut> ∇Wt )
many recent attempts have been made to understand the im- | {z }
∇t (1)
plicit bias of gradient descent (GD) in matrix factorisation in
both linear and non-linear networks. (Arora et al., 2019) in- +α 2
∇Wt Wt ∇Wt>
vestigated the behaviour of GD in deep linear networks and
found that as the depth of factorisation increases, GD tends (Khodak et al., 2021) extend the update equation above
to find low rank solutions. They also present evidence for to normalised layers. Most modern architectures rely on
the hypothesis that the language of norms such as nuclear normalisation layers to train networks that generalise well.
norm, Frobenius norm, etc, may not be enough to describe This includes batch normalisation (Ioffe & Szegedy, 2015)
the behaviour of GD. (Martin & Mahoney, 2018) presented in ResNets and layer normalisation (Ba et al., 2016) in Trans-
an empirical analysis of commonly used architectures and formers. We refer the reader to (Khodak et al., 2021) for a
characterised the dynamics of GD in deep non-linear net- more detailed discussion on the type and role of normali-
works in terms of Empirical Spectral Distributions (ESD) sation in factorised layers and use their formulation of the
and phases of training. They define a set of rank measures, normalised update equation, which is given by
which we use in our work to analyse low rank training jux- α
taposed with analysis on unfactored training. (Wang et al., ŵt+1 = ŵt − 2 (Imn − ŵt ŵt > )vec(∇
ˆ t)
kW kF (2)
2021) used low rank training with unfactorised pretraining
in the context of efficient communication in a distributed + O(α2 )
setting. (Khodak et al., 2021) proposed a low rank training
ˆ t is ∇t with gradients taken with respect to the
where ∇
procedure by investigating initialisation and regularisation
in factorised layers. They analysed SVD based initialisation normalised weight matrix Ŵ = kWWk and ŵ = vec(Ŵ ).
F

(Spectral Initialisation) and properties of L2 regularisation We see that gradient descent in the factorised setting does
which we study independently in our work. They conjecture not perfectly align with the vanilla gradient descent update.
that there is an interplay between normalisation and weight In the subsequent sections, we empirically explore and work
decay and formalise this behaviour through factorised up- to overcome the implicit biases of this factorised update so
date equations. that we can make low rank training an effective and efficient
training method.
3. Low Rank Training
In this section, we present the formulation we choose for 3.1.1. F ULLY CONNECTED LAYER
factorising layers. We discuss and critique the assumptions Let W ∈ Rm×n be the weight matrix of a fully-connected
and conjectures associated with the low rank formulation in layer. We factorise W as W = U V T with U ∈ Rm×r and
the context of SVD initialisation and L2 regularisation. V T ∈ Rr×n , where 0 < r ≤ min(m, n). At inference,
when r < m×nm+n , factorising the fully connected weight ma-
3.1. Factorisation trix leads to a reduced memory footprint as well as floating
In all our experiments and analyses, we factorise a weight point operations (flops) from O(mn) to O(mr + rn). For
matrix W at each layer into two components U and V such training, the memory requirements change from O(mn + n)
that W = U V > . to O(mr + rn + n + r) as we need to store the intermediate
activations for backpropagation.
We focus on a factorisation depth of 2, taking into consid-
eration memory-speedup tradeoffs: As the depth of factori- 3.1.2. C ONVOLUTIONAL LAYER
sation at each layer increases, more activations need to be
stored in-memory for backpropagation. A depth of two pro- We factorise convolution kernels in a way that supports
rewriting the single convolution as two convolutions.
Exploring Low Rank Training of Deep Neural Networks

We choose to factorise the convolutional kernel W ∈ 3.3. L2 Regularisation


Rh×w×cin ×cout as W = U V T with U ∈ Rh×w×cin ×r and
Many architectures rely on L2 regularisation for better
V T ∈ R1×1×r×cout where h, w represent the kernel height
generalisation. The straightforward approach to impose
and width respectively, cin and cout represent the number of
L2 regularisation in a factorised network is to apply the
input and output channels respectively and r represents the
Frobenius norm penalty to the factors U and V – that is,
rank of the decomposition. In the low-rank decomposition, λ 2 2
r ≤ min(h × w × cin , cout ). This leads to a reduction in 2 (kU kF + kV kF ). (Srebro & Shraibman, 2005) showed
that this penalty actually minimises the nuclear norm of the
flops from O(hwcin cout ) to O(hwcin r + rcout ).
recomposed matrix U V > .
3.2. Spectral Initialisation To address this, (Khodak et al., 2021) propose penalising
the Frobenius norm of the recomposed matrix U V > , which
(Khodak et al., 2021) investigated the usefulness of spectral they refer to as, Frobenius decay. They argue that Frobenius
initialisation in low rank formulations of deep learning ar- decay helps in keeping the effective step size high through
chitectures and proposed a few hypotheses for why it works. out training where effective step size is the term kWηk2
We use the same truncated SVD initialisation scheme, which F
in Equation 2. We show, through an ablations study, that
is defined as follows:
effective step size is an inadequate argument to justify the
SVDr (W ) = Û:r Σr V̂:r> , (3) effectiveness of Frobenius decay over L2 regularization. We
p point out that the dynamics of low-rank training with L2
U = Û:r Σr , regularisation cannot be understood by only considering the
p
V = V̂:r Σr , normalised update Equation 2. This ignores the ηλ ≈ O(η 2 )
terms arising from Frobenius norm penalty which have a
where W is a matrix of shape N × M , U of shape N × r, non-trivial impact on the optimisation. We find that the
V of shape M × r, Σ is the diagonal matrix of singular effectiveness of Frobenius decay over L2 regularisation can
values and r is the rank we choose for the factorisation. We be better explained by examining the effective rank of the
note that U and V are rectangular matrices unless specified network. We use the rank measure proposed in (Martin &
otherwise. Mahoney, 2018) which defines effective rank of a matrix W
(Khodak et al., 2021) analysed SVD based initialisation to be:
in the context of the update Equation 1 and provide two kW k∗
.
hypotheses for why this technique works, both of which we kW kop
disprove.
That is, the ratio between nuclear norm and the operator
norm. In our case, we are interested in the effective rank of
• U0 U0> = V0 V0> = Σr .
UV >
In the low rank context, U and V are rectangu-
lar matrices obtained from truncated SVD which 3.4. Pre-training
makes U and V column-wise orthogonal matrices. The initial stages of training are widely believed to be im-
Therefore, U U > and V V > cannot be equal to Σr portant for good performance in neural networks (Achille
and ∇Wt Vt Vt> + Ut Ut> ∇Wt terms in the Equation 1 et al., 2017) (Frankle et al., 2019a). This motivates us to
cannot be simplified. explore training for a fraction of the total training steps in
• The singular values of a Gaussian ensemble of scale the unfactorised space before switching to low rank substi-
√1 are roughly distributed around 1.
tutions of these unfactorised layers. We apply the truncated
n SVD scheme descibed in Equation 3 to the partially trained
We look to Marchenko-Pastur theory (described weights to obtain the factors of the layer. Section 4.3 de-
in Appendix A.1) to understand the distribution of scribes the impact of pre-training on performance across our
singular values of a Gaussian ensemble matrix of vision and language experiments and analyses the nature
size N × M , which states that the distribution of of the solutions found with pre-training when compared to
singular values is dependent on the scale of the random solutions found by low rank networks trained from scratch
N
initialisation σ 2 and the size ratio M of the layer. (Evci et al., 2019) (Frankle et al., 2019b).

4. Experiments and Results


We believe that spectral initialisation works for reasons other
than the ones stated in prior work. In Section 4.1, we present We conduct extensive experiments on both vision and lan-
an ablation experiment that hints at why this initialisation guage models. For vision models, we use a Wide-ResNet-
scheme performs better. 28 (Zagoruyko & Komodakis, 2016) on CIFAR-100 and a
Exploring Low Rank Training of Deep Neural Networks

ResNet-50 (He et al., 2015) on the ImageNet dataset. For


the language modelling task, we conduct experiments on 80
one million word benchmark dataset (LM1B) (Chelba et al., 70
2013) and use the GPT-2 (Radford et al., 2019) architecture. 60

Accuracy on Test Set


Details on our complete experimental setup can be found
50
in Appendix A.2. In the following sections, we compare
40 Low Rank
different initialisation schemes and study the effects of L2 pretrain: 5
regularisation and Frobenius decay. Finally, we demonstrate 30 pretrain: 10
pretrain: 15
the effectiveness of — and analyse the nature of solutions 20 pretrain: 20
found by — pre-training. pretrain: 25
10 pretrain: 30
0 pretrain: 40
4.1. Initialisation
0 2 4 6 8 10
We show that spectral initialisation offers equivalent perfor- Interpolation Step
mance when compared to traditional initialisation schemes.
Then, we show empirically that the singular values do not Figure 2. Comparison of interpolation of low rank and pre-trained
play a major role in improving performance and that it is the networks for ResNet-50 on ImageNet with a rank of 50 %.
direction of the singular vectors that matters. This finding
is in contrast with prior beliefs (Khodak et al., 2021) about Model Dataset Frobenius decay L2
the role of singular values in retaining the scale of initial- WRN CIFAR-100 39.87 16.4
isation. We establish this by setting the singular values to ResNet-50 ImageNet 68.72 58.00
ones in Equation 3. Tables 2, 3, 4 compare the results across Transformer LM1B 206.93 205.70
initialisation schemes on CIFAR100, ImageNet and LM1B
respectively. We observe that spectral ones leads to a better Table 1. Effective rank measures for different models
accuracy on CIFAR-100, lower perplexity on LM1B and a
commensurate performance on ImageNet.
4.3. Pre-training
4.2. L2 Regularisation We investigate pre-training networks for a fraction of the
total training steps and observe that this leads to a sig-
We investigate the effective step size hypothesis by training
nificantly improved performance in our language model
two networks, one with learning rate η and the other with η2 .
experiments as shown in Figures 1 and 3 when we scale
So, the effective step size of these networks is kWηk2 and
η
F up the model. We pre-train in the unfactorised space for
2kW k2F
respectively, based on Equation 2. If the hypothesis 40,000 steps and continue training in the factorised space for
that a higher effective step size leads to better performance 200,000 steps. We combine pre-training with the techniques
were true, we should see that halving the effective step size aforementioned viz Frobenius decay and resuming with de-
should lead to a lower performance but we find that η2 leads compositions obtained from Spectral and Spectral ones as
to models that are atleast as good as models trained with described in 3.4. We find that pre-training does not offer im-
learning rate η. proved performance compared to low-rank network trained
Tables 5, 6 and 7 compare the impact of effective step size from scratch in our vision experiments as shown in Tables 8
on performance across CIFAR-100, ImageNet and LM1B and 9. Furthermore, we notice that the solutions found with
respectively. Analysing the evolution of singular values pre-training are closer in the parameter space to their corre-
in networks trained with L2 regularisation and Frobenius sponding baseline (unfactorised) models. We demonstrate
decay revealed that singular values are disproportionately this by performing linear interpolation, shown in Figures 2,
affected in the case of L2 regularisation. We observe a "rich 4 and 5, between pre-training and baseline weights by using
get richer, poor get poorer" phenomenon in L2 regularised the following equation: θ = (1−t)θb +tθl for t ∈ [0.0, 1.0]
kU V > k with increments of 0.1 where t is the interpolation coeffi-
networks which causes the effective rank kU V > k ∗ of the cient, θb is the parameter from the baseline model and θl is
op
network to drop because of the disproportionate increase the parameter from the low rank model with pre-training.
in the operator norm of each layer. We report the averaged
(across layers) effective rank at the end of training for our 5. Conclusion
experiments in Table 1.
We demonstrated empirically that Spectral initialisation and
L2 regularisation on U V > improve low-rank training but
are poorly understood. We presented singular value analy-
ses and ablation studies that act as counter-examples to prior
Exploring Low Rank Training of Deep Neural Networks

beliefs about why these techniques work. We hope to put Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M.
forth the theoretical reasons behind the effectiveness of these Linear mode connectivity and the lottery ticket hypothesis.
techniques in a future work. Additionally, we demonstrated CoRR, abs/1912.05671, 2019b. URL http://arxiv.
pretraining as an effective strategy to improve low-rank per- org/abs/1912.05671.
formance and presented insights on the nature of solutions
found by networks with pretraining. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. CoRR, abs/1512.03385, 2015.
URL http://arxiv.org/abs/1512.03385.
References
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
Achille, A., Rovere, M., and Soatto, S. Critical
deep network training by reducing internal covariate shift,
learning periods in deep neural networks. CoRR,
2015.
abs/1711.08856, 2017. URL http://arxiv.org/
abs/1711.08856. Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up
convolutional neural networks with low rank expansions,
Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regular- 2014.
ization in deep matrix factorization, 2019.
Khodak, M., Tenenholtz, N. A., Mackey, L., and Fusi, N. Ini-
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization, tialization and regularization of factorized neural layers.
2016. In International Conference on Learning Representations,
2021. URL https://openreview.net/forum?
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, id=KTlJT1nof6d.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Lee, N., Ajanthan, T., Gould, S., and Torr, P. H. S. A signal
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, propagation perspective for pruning neural networks at
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., initialization. CoRR, abs/1906.06307, 2019. URL http:
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, //arxiv.org/abs/1906.06307.
S., Radford, A., Sutskever, I., and Amodei, D. Language
models are few-shot learners, 2020. Martin, C. H. and Mahoney, M. W. Implicit self-
regularization in deep neural networks: Evidence from
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., random matrix theory and implications for learning, 2018.
and Koehn, P. One billion word benchmark for measur-
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
ing progress in statistical language modeling. CoRR,
Sutskever, I. Language models are unsupervised multitask
abs/1312.3005, 2013. URL http://arxiv.org/
learners. 2019.
abs/1312.3005.
Srebro, N. and Shraibman, A. Rank, trace-norm and max-
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, norm. In Auer, P. and Meir, R. (eds.), Learning Theory,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, pp. 545–560, Berlin, Heidelberg, 2005. Springer Berlin
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. Heidelberg. ISBN 978-3-540-31892-7.
An image is worth 16x16 words: Transformers for image
recognition at scale. CoRR, abs/2010.11929, 2020. URL Tai, C., Xiao, T., Zhang, Y., Wang, X., and E, W. Con-
https://arxiv.org/abs/2010.11929. volutional neural networks with low-rank regularization,
2016.
Evci, U., Pedregosa, F., Gomez, A. N., and Elsen, E. The
difficulty of training sparse neural networks. CoRR, Wang, H., Agarwal, S., and Papailiopoulos, D. Pufferfish:
abs/1906.10732, 2019. URL http://arxiv.org/ Communication-efficient models at no extra cost, 2021.
abs/1906.10732.
Yu, X., Liu, T., Wang, X., and Tao, D. On compressing
Fedus, W., Zoph, B., and Shazeer, N. Switch transform- deep models by low rank and sparse decomposition. pp.
ers: Scaling to trillion parameter models with simple and 67–76, 2017. doi: 10.1109/CVPR.2017.15.
efficient sparsity. CoRR, abs/2101.03961, 2021. URL
Zagoruyko, S. and Komodakis, N. Wide residual networks.
https://arxiv.org/abs/2101.03961.
CoRR, abs/1605.07146, 2016. URL http://arxiv.
Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, org/abs/1605.07146.
M. The lottery ticket hypothesis at scale. CoRR,
abs/1903.01611, 2019a. URL http://arxiv.org/
abs/1903.01611.
Exploring Low Rank Training of Deep Neural Networks

A. Appendix A.3. Initialization Results


A.1. Marchenko-Pastur Theory
Rank Initialisation Accuracy
Marchenko-Pastur (MP) theory defines the distribution of Baseline (N/A) He 81.08
singular values of Gaussian random matrices in the infinite He 77.94
limit but is applicable to finite matrices with very reasonable 0.1 spectral 79.84
error bounds. MP theory defines the distribution as: spectral ones 79.07
( √ He 80.37
(λ+ −λ)(λ−λ− ) 0.2 spectral 81.35
N
2πσ 2 M if λ ∈ [λ− , λ+ ]
ρ(λ) = λ (4) spectral ones 81.27
0 otherwise
He 80.87
0.3 spectral 81.53
spectral ones 81.61
 r 2 Table 2. Initialization results of Wide Resnets on Cifar-100
± 2 M
λ =σ 1± , (5)
N

A.2. Experiment Details Rank Initialisation Top-1 Top-5


Baseline (N/A) He 76.39 93.21
For the language modelling task, we conduct our exper- He 75.26 92.56
iments on one million word benchmark dataset (LM1B) 0.3 spectral 75.77 92.87
(Chelba et al., 2013) and use the following set up: input spectral ones 75.71 92.82
sequence length is fixed at 256 and 1152 tokens for training He 75.97 92.84
and evaluation respectively and the vocab size is limited to 0.5 spectral 76.13 93.09
32K subwords and train all the models to 240K steps. We spectral ones 75.98 92.97
implemented transformer language model on Tensorflow
and run all our experiments on cloud TPUs. To have better Table 3. Initialization results of ResNet on Image Net
savings on compute and memory we combine the query, key
value generation into one weight matrix. For each trans-
former layer, we decompose three matrix operations; Q,K,V Rank Initialisation Perplexity
generation and the two fully connected layers. We skip fac- Baseline (N/A) He 37.67
torising the output projection layer and the combiner layer He 39.6
that combines the outputs of attention (this is a square ma- 0.62 spectral 38.78
trix and we see memory and compute benefit only for very spectral ones 38.47
small ranks). For all transformer runs, we choose a rank of
62.5% and half its baseline learning rate. For pre-training, Table 4. Initialization results of Transformers on LM1B
we train unfactored for 40K steps then switch to low rank
factorised training for the remaining 200K steps and halving
the learning rate. A.4. Regularization Results
For the image classification task, we conduct experiments
with CIFAR-100 and ImageNet. For CIFAR-100 we use Rank Regularisation lr scaling Accuracy
the standard training/test split with a simple augmentation 0.5 73.12
scheme – Random Crop and Horizontal Flips. We train a L2
1.0 72.59
WideResNet-28 (Zagoruyko & Komodakis, 2016) for 200 0.1
0.5 79.84
epochs with SGD with momentum (0.9) and a batch size Frobenius Decay
1.0 79.79
of 128. For regularisation, we a weight decay coefficient 0.5 78.22
of 5e-4 and no dropout. For the low rank training runs, L2
1.0 77.56
we factorised every convolutional layer other than the first 0.2
0.5 81.35
according to our factorisation scheme describe above and the Frobenius Decay
1.0 81.61
chosen rank. For ImageNet experiments, we use a standard
ResNet-50 architecture and train on a TPU v2-8 with a per- Table 5. Comparison between Frobenius Decay and L2 regularisa-
core batch size of 128 and follow the same hyperparameters tion on Cifar-100
and learning rate schedule described in (He et al., 2015).
Exploring Low Rank Training of Deep Neural Networks

Rank Regularization lr scaling Top-1 Top-5


38 Baseline
0.5 75.11 92.42 Spectral
L2 37 Spectral Ones
1.0 74.9 92.24
0.3
0.5 75.22 92.49 36
Frobenius Decay
1.0 75.77 92.87

Perplexity
0.5 75.04 92.36 35
L2
0.5
1.0 74.83 92.25 34
0.5 75.97 92.85
Frobenius Decay
1.0 76.13 93.09 33
32
Table 6. Comparison between Frobenius Decay and L2 regularisa-
tion on Imagenet 31
100 200 300 400 500
Total Parameters (Millions)
Rank Regularisation lr scaling Perplexity
0.5 38.87 Figure 3. Total parameters vs Performance of GPT-2 on LM1B as
L2
1.0 39.01 the model is scaled up. Each point on the line corresponds to a
0.62
0.5 38.78 different model size starting from 1024 hidden dimensions (on the
Frobenius Decay
1.0 39.2 top left) to 2560 (in the bottom right) with increments of 256.

Table 7. Comparison between Frobenius Decay and L2 regularisa-


tion on LM1B

80
A.5. Pre-training Results 70
60
Accuracy on Test Set

Rank Pre-training Epochs Accuracy 50


Low Rank
0 81.35 40 pretrain: 10
15 81.33 pretrain: 15
30 pretrain: 20
30 81.56 20 pretrain: 30
0.2 40 81.53 pretrain: 40
10 pretrain: 50
50 81.39 pretrain: 75
0
75 81.53 0 2 4 6 8 10
0 81.53 Interpolation Step
15 81.73
30 81.51 Figure 4. Comparison of interpolation of low rank and pre-trained
0.3 40 81.67 networks for WideResNet-28 on CIFAR-100 with a rank of 30%.
50 82.0
75 81.44
4
Table 8. Pre-training results for Wide ResNets on CIFAR-100

Rank # Pretrain epochs Top-1 Top-5


Log Likelihood

6
5 76.07 92.88
10 75.96 93.04
15 76.12 92.96 7

20 76.08 92.94
0.5 25 76.15 93.00 8
30 76.05 92.9 Low Rank
Pretrain: 40K
35 76.24 93.06 Pretrain: 120K
0 2 4 6 8 10
40 76.21 93.09 Interpolation Step

45 76.29 93.12
Figure 5. Comparison of interpolation of low rank and pretrained
Table 9. Pre-training results for ResNet50 on ImageNet networks for transformer LM.

You might also like