0% found this document useful (0 votes)

11 views

Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition

1. The document proposes MAESTRO, a framework for training deep neural networks to have low-rank structures. It does this by building the low-rank structure directly into the training process through an extension of Ordered Dropout, rather than regularly applying decompositions like SVD. 2. MAESTRO imposes an importance ordering on the decomposed DNN structure by sampling during training. This allows it to recover SVD and PCA decompositions under certain conditions. 3. The authors apply MAESTRO to DNNs and show it enables extracting lower footprint models that preserve performance while allowing graceful accuracy-latency tradeoffs for different devices.

Uploaded by

Yuriy Kochura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition

Uploaded by

Yuriy Kochura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Maestro: Uncovering Low-Rank Structures via

Trainable Decomposition

Samuel Horvath Stefanos Laskaridis

MBZUAI Brave Software
[email protected] [email protected]
arXiv:2308.14929v1 [cs.LG] 28 Aug 2023

Shashank Rajput Hongyi Wang

University of Wisconsin-Madison CMU
[email protected] [email protected]

Abstract
Deep Neural Networks (DNNs) have been a large driver and enabler for AI break-
throughs in recent years. These models have been getting larger in their attempt to
become more accurate and tackle new upcoming use-cases, including AR/VR and
intelligent assistants. However, the training process of such large models is a costly
and time-consuming process, which typically yields a single model to fit all targets.
To mitigate this, various techniques have been proposed in the literature, including
pruning, sparsification or quantization of the model weights and updates. While
able to achieve high compression rates, they often incur computational overheads
or accuracy penalties. Alternatively, factorization methods have been leveraged
to incorporate low-rank compression in the training process. Similarly, such tech-
niques (e.g., SVD) frequently rely on the computationally expensive decomposition
of layers and are potentially sub-optimal for non-linear models, such as DNNs.
In this work, we take a further step in designing efficient low-rank models and
propose M AESTRO, a framework for trainable low-rank layers. Instead of regularly
applying a priori decompositions such as SVD, the low-rank structure is built
into the training process through a generalized variant of Ordered Dropout. This
method imposes an importance ordering via sampling on the decomposed DNN
structure. Our theoretical analysis demonstrates that our method recovers the SVD
decomposition of linear mapping on uniformly distributed data and PCA for linear
autoencoders. We further apply our technique on DNNs and empirically illustrate
that M AESTRO enables the extraction of lower footprint models that preserve
model performance while allowing for graceful accuracy-latency tradeoff for the
deployment to devices of different capabilities.

1 Introduction
Deep Learning has been experiencing an unprecedented uptake, with models achieving a
(super-)human level of performance in several tasks across modalities, giving birth to even more in-
telligent assistants and next-gen visual perception and generation systems. However, the price of this
performance is that models are getting significantly larger, with training and deployment becoming
increasingly costly. Therefore, techniques from Efficient ML become evermore relevant [27], and a
requirement for deployment in constrained devices, such as smartphones or IoT devices.
Typical techniques to compress the network involve i) quantization, i.e., reducing precision of the
model [52] or communicated updates [45, 2], ii) pruning the model during training, e.g., through
Lottery Ticket Hypothesis (LTH) [11], iii) sparsification of the network representation and updates,

Preprint. Under review.

i.e., dropping the subset of coordinates [49, 3] or iv) low-rank approximation [53, 9], i.e. keeping the
most relevant ranks of the decomposed network. Despite the benefits during deployment, that is a
lower footprint model, in many cases, the overhead during training time or the accuracy degradation
can be non-negligible. Moreover, many techniques can introduce mutliple hyperparameters or the
need to fine-tune to recover the lost accuracy.
In this work, we focus on training low-rank factorized models. Specifically, we pinpoint the challenges
of techniques [53, 54] when decomposing the parameters of each layer in low-rank space and the need
to find the optimal ranks for each one at training time. To solve this, we adopt and non-trivially extend
the Ordered Dropout technique from [17] and apply it to find progressively the optimal decomposition
for each layer of a network while training (Fig. 1). Critical differences to prior work include i) the
non-uniformity of the search space (i.e. we allow for different ranks per layer), ii) the trainable aspect
of the decomposition to reflect the data distribution, and iii) the gains to training and deployment time
without sacrificing accuracy. Nevertheless, we also provide a latency-accuracy trade-off mechanism
to deploy the network on even more constrained devices.
Original Mapping Factorized Mapping Ordered Representation Low-rank Approximation

U:k
A U V⊤ U V⊤ m×k V:k⊤
m×n m×r r×n m×r r×n
k×n
Factorization Ordered Dropout Pruning

selected rank b
layer 
i ∈ [1,d ]
Sample
(i, b) ∼ {{(i, b)}rb=1
i
}di=1 maximal rank ri=k
layer 
i+1

Figure 1: M AESTRO’s construction. To obtain low-rank approximation, the given linear map is decomposed
and trained with ordered dropout to obtain an ordered representation that can be efficiently pruned.

Our contributions can be summarized as follows:

• We propose M AESTRO, a novel layer decomposition technique that enables learning low-rank layers
in a progressive manner while training. We novelly fuse layer factorization and an extended variant
of the ordered dropout, by embedding OD directly into the factorized weights. By decomposing
layers and training on stochastically sampled low-rank models, we apply ordered importance
decomposed representation of each layer. We combine this with a hierarchical group-lasso
term [64] in the loss function to zero out redundant ranks and progressively shrink the rank space.
This way, we enable computationally efficient training achieved by the proposed decomposition
without relying on inexact and potentially computationally expensive decompositions such as SVD.
• M AESTRO is a theoretically motivated approach that embeds decomposition into training. First,
we show that our new objective is able to recover i) the SVD of the target linear mapping for the
particular case of uniform data distribution and ii) the Principal Component Analysis (PCA) of the
data in the case of identity mapping.
• As M AESTRO’s decomposition is part of the training procedure, it also accounts for data distribution
and the target function, contrary to SVD, which operates directly on learned weights. We show
that this problem already arises for a simple linear model and empirically generalize our results in
the case of DNNs, by applying our method to different types of layers (including fully-connected,
convolutional, and attention) spanning across three datasets and modalities. We illustrate that our
technique achieves better results than SVD-based baselines at a lower cost.

2 Related work
The topic of Efficient ML has received a lot of attention throughout the past decade as networks
have been getting increasingly computationally expensive. Towards this end, we distinguish between
training and deployment time, with the latter having a more significant impact and thus amortizes
the potential overhead during training. Nevertheless, with the advent of Federated Learning [36],
efficient training becomes increasingly relevant to remain tractable.
Efficient inference. For efficient deployment, there have been proposed various techniques that either
optimize the architecture of the DNN in a hand-crafted [19] or automated manner (i.e. NAS) [50],
they remove redundant computation by means of pruning parts of the network [12, 6, 11, 48, 30, 55,

2
21, 55, 65, 15, 59, 33, 62] or utilise low-precision representation [52] of the neurons and activations.
Closer to our method, there have been techniques leveraging low-rank approximation (e.g. SVD)
for efficient inference [58, 43, 22, 56, 9]. Last, there is a category of techniques that dynamically
resize the network at runtime for compute, memory or energy efficiency, based on early-exiting [26]
or dynamic-width [63] and leverage the accuracy-latency tradeoff.
Efficient training. On the other hand, techniques for efficient training become very relevant nowadays
when scaling DNNs sizes [20] or deploying to embedded devices [32], and oftentimes offer additional
gains at deployment time. Towards this goal, there have been employed methods where part of
the network is masked [46] or dropped [1, 5] during training, with the goal of minimizing the
training footprint. Similarly to early-exiting, there have been proposed multi-exit variants for efficient
training [24, 34], and the same applies for width-based scaling [17, 8]. Last but not least, in the era of
transformers and LLMs, where networks have scaled exponentially in size, PEFT-based techniques,
such as adapter-based fine-tuning [18] (such as LoRA [20]), become increasingly important and make
an important differentiator for tackling downstream tasks.
Learning ordered representation. Originally, Ordered Dropout (OD) was proposed as a mechanism
for importance-based pruning for the easy extraction of sub-networks devised to allow for heteroge-
neous federated training [17]. The earlier work that aims to learn ordered representation includes a
similar technique to OD—Nested Dropout, which proposed a similar construction, applied to the
representation layer in autoencoders [42] to enforce identifiability of the learned representation or
the last layer of the feature extractor [16] to learn an ordered set of features for transfer learning.
We leverage and non-trivially extend OD in our technique as a means to order ranks in terms of
importance in a nested manner during training of a decomposed network that is progressively shrunk
as redundant ranks converge to 0. Ranks selection is ensured through hierarchical group lasso penalty,
as described in Sec. 3.3. Moreover, contrary to [17], which assumed a uniform width, our formulation
allows for heterogeneous ranks per layer. Last, we leverage the ordered representation of ranks at
inference time to further compress the model, allowing a graceful degradation of performance as a
mechanism for the accuracy-latency trade-off.

3 M AESTRO
In this work, we focus on low-rank models as a technique to reduce the computational complexity and
memory requirements of the neural network model. The main challenge that we face is the selection
of the optimal rank or the trade-off between the efficiency and the rank for the given layer represented
by linear mapping. Therefore, we devise an importance-based training technique, M AESTRO, which
not only learns a mapping between features and responses, but also learns the decomposition of the
trained network. This is achieved by factorizing all the layers in the network.

3.1 Formulation

Low-rank approximation. Our inspiration comes from the low-rank matrix approximation of a
matrix A ∈ Rm×n . For simplicity, we assume that A has rank r = min{m, n} with k ≤ r distinct
non-zero singular values σ̃1 > σ̃2 > . . . > σ̃k > 0, with corresponding left and right singular vectors
ũ1 , ũ2 , . . . , ũk ∈ Rm and ṽ1 , ṽ2 , . . . , ṽk ∈ Rn , respectively. For such a matrix, we can rewrite its
best l-rank approximation as the following minimization problem
2
X l
min ui vi⊤ − A (1)

U ∈Rm×l ,V ∈Rn×l
i=1 F
where ci denotes the i-th row of matrix C and ∥·∥F denotes Frobenius norm. We note that Prob-
lem (1) is non-convex and non-smooth. However, [60] showed that the randomly initialized gradient
descent algorithm solves this problem in polynomial time. In this work, we consider the best rank
approximation across all the ranks that leads us to the following objective
r
1 X U:b V:b⊤ − A 2 ,

min F (2)
U ∈R m×r ,V ∈Rn×r r
b=1
where C:b denotes the first b columns of matrix C. This objective, up to scaling, recovers SVD
of A exactly, and for the case of distinct non-zero singular values, the solution is, up to scaling,
unique [17]. This formulation, however, does not account for the data distribution, i.e., it cannot tailor
the decomposition to capture specific structures that appear in the dataset.

3
Data-dependent low-rank approximation. Therefore, the next step of our construction is to
extend this problem formulation with data that can further improve compression, reconstruction, and
generalization, and incorporate domain knowledge. We assume that data comes from the distribution
x ∼ X centered around zero, i.e., Ex∼X [x] = 0.1 , and the response is given by y = Ax. In this
particular case, we can write the training loss as " #
r
X 1 ⊤
2
min Ex,y∼X U:b V:b x − y . (3)
U ∈Rm×r ,V ∈Rn×r r
b=1
It is important to note that the introduced problem formulation (3) is the same as the Ordered Dropout
formulation of [17] for the neural network with a single hidden layer and no activations, and it can be
solved using stochastic algorithms by sampling from the data distribution X (subsampling) and rank
distribution D. However, there is an important distinction when we apply M AESTRO for deep neural
networks. While FjORD applies uniform dropout across the width of the network for each layer, we
propose to decompose each layer independently to uncover its – potentially different – optimal rank
for deployment. We discuss details in the next paragraph.
DNN low-rank approximation. For Deep Neural Networks (DNNs), we seek to uncover the optimal
ranks for a set of d linear mappings W 1 ∈ Rm1 ×n1 , . . . , W d ∈ Rmd ×nd , where W i ’s are model
parameters and d is model depth, e.g., weights corresponding to linear layers2 , by decomposing
⊤
them as W i = U i V i . We discuss how these are selected in the next section. To decompose the
network, we aim to minimize the following objective:
" d Xri
#
1 X
1 1 ⊤
i i ⊤
d d ⊤
o
Ex,y∼X Pd l(h(U V , . . . , U:b V:b , . . . , U V , W , x), y) , (4)
i=1 ri i=1 b=1

where ri = min{mi , ni }, l is a loss function, h is a DNN, and W o are the other weights that we do
not decompose. We note that our formulation aims to decompose each layer, while decompositions
across layers do not directly interact. The motivation for this approach is to uncover low-rank
structures within each layer that are not affected by inaccuracies from other layers due to multiple
low-rank approximations.

3.2 Layer factorization

The following subsections discuss how model factorization is implemented for different model
architectures.
FC layers. A 2-layer fully connected (FC) neural network can be expressed as f (x) =
σ(σ(xW1 )W2 ), where W s are weight matrices of each FC layer, and σ(·) is any arbitrary acti-
vation function, e.g., ReLU. The weight matrix W can be factorized as U V ⊤ .
CNN layers. For a convolution layer with dimension, W ∈ Rm×n×k×k where m and n are the
number of input and output channels, and k is the size of the convolution filters. Instead of directly
factorizing the 4D weight of a convolution layer, we factorize the unrolled 2D matrix. Unrolling the
2
4D tensor W leads to a 2D matrix with shape Wunrolled ∈ Rmk ×n , where each column represents the
weight of a vectorized convolution filter. Factorization can then be conducted on the unrolled 2D
matrix; see [53] for details.
Transformers. A Transformer layer consists of a stack of encoders and decoders [51]. The encoder
and decoder contain three main building blocks: the multi-head attention layer, position-wise feed-
forward networks (FFN), and positional encoding. We factorize all trainable weight matrices in the
multi-head attention (HMA) and the FFN layers. The FFN layer factorization can directly adopt the
strategy from the FC factorization. A p-head attention layer learns p attention mechanisms on the
key, value, and query (K, V, Q) of each input token:
MHA(Q, K, V ) = Concat(head1 , . . . , headp )W O .
Each head performs the computation of:
(i) (i)⊤
!
(i) (i) (i) QWQ WK K ⊤ (i)
headi = Attention(QWQ , KWK , V WV ) = softmax p V WV .
d/p
1
We make this assumption for simplicity. It can be simply overcome by adding a bias term into the model.
2
We can apply our decomposition on different types of layers, such as Linear, Convolutional and Transformers as shown in Sec. 3.2.

4
(i) (i) (i)
where d is the hidden dimension. The trainable weights WQ , WK , WV , i ∈ {1, 2, . . . , p} can
be factorized by simply decomposing all learnable weights W · in an attention layer and obtaining
U ·V ⊤ · [51].

3.3 Training techniques

Having defined the decomposition of typical layers found in DNNs, we move to formulate the
training procedure of our method, formally described in Algorithm 1. Training the model comprises
an iterative process of propagating forward on the model by sampling a rank bi per decomposed layer
i up to maximal rank ri (line 3). We calculate the loss, which integrates an additional hierarchical
group lasso component (lines 4) and backpropagate on the sampled decomposed model (line 5). At
the end of each epoch, we progressively shrink the network by updating the maximal rank ri , based
on an importance threshold εps (line 11). We provide more details about each component below.

Algorithm 1: M AESTRO (Training Process)

Input: epochs E, dataset D, model h parametrized by U 1 ∈ Rm1 ×r1 ,
V 1 ∈ Rn1 ×r1 , . . . , U d ∈ Rmd ×rd , V d ∈ Rnd ×rd , W o , and hyperparameters λgl , εps
1 for t ← 0 to E − 1 do // Epochs
2 for (x, y) ∈ D do // Iterate over dataset
d
Sample (i, b) ∼ {(i, b)}rb=1
i
3
i=1
;
⊤ ⊤ ⊤
L = l(h(U 1 V 1 , . . . , U:bi V:bi , . . . , U d V d , W o , x), y)+

4

+λgl di=1 rb=1

P
P i i i
Ub: + Vb: // compute loss
5 L.backward() // Update weights
6 end
7 for i ← 1 to d do
8 for b ← 1 to ri do
9 // rank
importance
thresholding
10 if Vb:i Ub:i ≤ εps then
11 ri = b − 1 // progressive shrinking
12 break
13 end
14 end
15 end
16 end

Efficient training via sampling. In Sec. 4, we show that for the linear case (3), the optimal solution
corresponds to PCA over the linearly transformed dataset. This means that the obtained solution
contains orthoghonal directions. This property is beneficial because it directly implies that when we
employ gradient-based optimization, not only is the gradient zero at the optimum, but the gradient
with respect to each summand in Equation (3) is also zero. The same property is directly implied
by overparametrization [35] or strong growth condition [44]. As a consequence, this enables us to
sample only one summand at a time and obtain the same quality solution. When considering (4)
as an extension to (3), it is unclear whether this property still holds, which would also imply that
the set of stationary points of (3) is a subset of stationary points of the original objective without
decomposition. However, in the experiments, we observed that sampling is sufficient to converge to a
good-quality solution. If this only holds approximately, we one could leverage fine-tuning to recover
the loss in performance.
Efficient rank extraction via hierarchical group-lasso. By definition, (3) leads to an ordered set
of ranks for each layer. This ordered structure enables efficient rank extraction and selection. To
effectively eliminate unimportant ranks while retaining the important ones, thus leading to a more
efficient model, we consider Hierarchical Group Lasso (HGL) [31] in the form
Xd Xri
i i
λgl Ub: + Vb: , (5)
i=1 b=1
where Cb: denotes the matrix that contains all the columns of C except for the first b − 1 columns.
Progressive shrinking. HGL encourages that unimportant ranks become zero and can be effectively
i i
removed
i from the model. To account for this, for each layer we remove Vb: and Ub: (i.e., set ri = b−1)
i
if Vb: Ub: ≤ εps , where εps is a pre-selected threshold – and a hyperparameter of our method.

5
k = ... (p = ...)
102 1 (0.2)
2 (0.3) 1.0
3 (0.5)
101 4 (0.7) k = ... (p = ...)
5 (0.8) 0.8 1 (0.1)
6 (1.0) 2 (0.2)
100

Singular Values
Ak||2F
3 (0.3)
0.6 4 (0.4)
5 (0.5)
10 1

|| uivi
6 (0.6)
7 (0.7)
i=1
k
0.4 8 (0.8)
10 2 9 (0.9)
10 (1.0)
0.2
10 3

0.0
0 250 500 750 1000 1250 1500 1750 2000 0 250 500 750 1000 1250 1500 1750 2000
Iterations Iterations
(a) Verification that M AESTRO recovers SVD for lin- (b) Verification that M AESTRO recovers PCA for iden-
ear mapping with uniform data. The plot displays the tity mapping. The plot displays the estimates of singu-
L2 distance between the best rank k and M AESTRO’s lar values. The data distribution has only 3 directions.
approximation of mapping A. The target matrix was It is expected that the top 3 ranks will converge to
randomly generated 9 × 6 matrix with rank 3. p and value one and the rest to zero. p and k stand for rela-
k stand for relative and actual rank, respectively. tive and actual rank, respectively.
Figure 2: Empirical showcase of theoretical properties of the M AESTRO’s formulation.
Initialization. Initialization is a key component of the training procedure [13, 37]. To adopt the
best practices from standard non-factorized training, we follow a similar approach to [23, 53], where
we first initialize the non-factorized model using standard initialization. For initializing factorized
layers, we use the Singular Value Decomposition (SVD) of the non-factorized initialization – in a
full-rank form – to ensure that the resulting product matrix is the same as the original parameter
decomposition. In addition, SVD is an optimal decomposition for the linear case with uniform data.
However, in contrast with the adaptive baseline method [54] we only decompose once, rather than on
every training iteration.
3.4 Train-once, deploy-everywhere

Up until now, we have described how our method works for training low-rank models, which yield
computational, memory, network, and energy [57] bandwidth benefits during training. At deployment
time, one can directly deploy the final model (rank ri for each layer) on the device, which we
acquire from performing a threshold sweep of εps over the effective range of rank importance across
layers. However, in case we want to run on even more constrained devices, such as mobile [4] or
embedded [4] systems, the learned decomposition also gives us the flexibility to further compress
the model in a straightforward manner, effectively trading off accuracy for a smaller model footprint.
Inspired by [61], we propose to use greedy search. We begin with the current model and compare
model performance across various low-rank models, each created by removing a certain percentage
of ranks from each layer. We then eliminate the ranks that cause the least decrease in performance.
This process is iterated until we reach the desired size or accuracy constraint. To make this approach
efficient, we estimate the loss using a single mini-batch with a large batch size, for example, 2048.
This also avoids issues with BatchNorm layers; see [61] for details.
In summary, M AESTRO comprises a technique for trainable low-rank approximation during training
time that progressively compresses the model, reflecting the data distribution, and a method that
enables a graceful trade-off between accuracy and latency for embedded deployment, by selecting
the most important parts of the network. We validate these claims in Sec. 5.2 and 5.5, respectively.

4 Theoretical guarantees
In this section, we further investigate the theoretical properties of M AESTRO for the linear mappings,
i.e., the setup of the problem formulation (3).
Theorem 4.1 (Informal). Let A = Ũ Σ̃Ṽ ⊤ be a SVD decomposition of A. Then, the minimization
problem (3) is equivalent to PCA applied to the transformed dataset x → Σ̃Ṽ ⊤ x, x ∼ X projected
on the column space of Ũ .

The formal statement can be found in Appendix D. Theorem 4.1 shows that M AESTRO can adapt to
data distribution by directly operating on data x ∼ X and also to the target mapping by projecting
data to its right singular vectors scaled by singular values. In particular, we show that in the special
case, when X is the uniform distribution on the unit ball, (3), i.e., M AESTRO, exactly recovers
truncated SVD of A, which is consistent with the prior results [17]. In the case A is the identity, it is
straightforward to see that M AESTRO is equivalent to PCA. We can see that M AESTRO can efficiently

6
extract low-rank solutions by filtering out directions corresponding to the null space of the target
mapping A and directions with no data. We also numerically verify both of the special cases–PCA
and SVD, by minimizing (3) using stochastic gradient descent (SGD) with D being the uniform
distribution. These preliminary experiments are provided in Fig. 2a and 2b.
We showed that M AESTRO could recover SVD in a particular case of the linear model and the uniform
data distribution on the unit ball. We note that in this case, SVD is optimal, and we cannot acquire
better decomposition. Therefore, it is desired that M AESTRO is equivalent to SVD in this scenario. In
the more general setting, we argue that M AESTRO decomposition should be preferable to SVD due to
the following reasons:
• M AESTRO formulation is directly built into the training and tailored to obtain the best low-rank
decomposition, while SVD relies on linearity assumption.
• SVD does not account for data, and even in the linear NN case, the learned singular vectors might
exhibit wrong ordering. We demonstrate this issue using a simple example where we take matrix
A with rank 3. We construct the dataset X in such a way that the third singular vector is the most
important, the second one is the second, and the first is the third most important direction. Clearly,
SVD does not look at data. Therefore, it cannot capture this phenomenon. We showcase that
M AESTRO learns the correct order; see Fig. 5 of the Appendix.
• Pre-factorizing models allow us to apply hierarchical group-lasso penalty [64] for decomposed
weights to directly regularize the rank of different layers.
• SVD is computationally expensive and can only run rarely, while M AESTRO is directly built into
the training and, therefore, does not require extra computations. In addition, M AESTRO supports
rank sampling so training can be made computationally efficient.

5 Experiments
We start this section by describing the setup of our experiments, including the models, datasets and
baselines with which we compare M AESTRO. We then compare M AESTRO against the baselines on
accuracy and MAC and discuss the results. Subsequently, we analyze the behaviour of our system
in-depth and provide additional insights on the performance of our technique, along with an ablation
study and sensitivity analysis to specific hyperparameters. Finally, we showcase the performance
of models upon deployment and how we can derive a smaller footprint model with some accuracy
trade-off, without the need to fine-tune.

5.1 Experimental setup

Models & datasets. The datasets and models considered in our experiments span across four datasets,
concisely presented along with the associated models on Tab. 1. We have implemented our solution
with PyTorch [38](v1.13.0) trained our models on NVidia A100 (40G) GPUs. Details for the learning
tasks and hyperparameters used are presented in the Appendix.
Baselines. We have selected various baselines Table 1: Datasets and models for evaluation. The net-
from the literature that we believe are closest to work footprints depict the vanilla variants of the models.
aspects of our system. On the pruning front, we Dataset Model # GMACs # Params (M) Task
compare with the IMP [40] and RareGems [48] MNIST LeNet 2e 0.04 Image classification −4

0.56 11.18 Image classification

techniques, themselves based on the LTH [11]. CIFAR10
CIFAR10
ResNet-18
VGG-19 0.40 20.00 Image classification
5.19 53.9 Image classification
On the quantization front, we compare with TinyImageNet
Multi30k
ResNet-50
6-layer Transformer 1.37 48.98 Translation (en-ge)
XNOR-Net [41]. With respect to low-rank meth-
ods, we compare with Spectral Initialisation [23], Pufferfish [53] and Cuttlefish [54].

5.2 Performance comparison

We start off by comparing M AESTRO with various baselines from the literature across different
datasets and types of models3 . Results are depicted in Tab. 2 and 3, while additional performance
points of M AESTRO for different model footprints are presented in the Appendix F.2 and F.3.
Comparisons with low-rank methods. The low-rank methods we are comparing against are
Pufferfish [53] and Cuttlefish [54]. These methods try to reduce training and inference runtime while
3
The operating points we select for M AESTRO are the closest lower to the respective baseline in terms of footprint. Where the result is not present in the Tab. 2, we
provide the λgp value so that it can be referenced from the Appendix, Tab. 11, 12.

7
Table 2: Maestro vs. baselines on CIFAR10. Table 3: Maestro vs. baselines on Multi30k.
Variant Model Acc. (%) GMACs Params. (M ) Variant Model Perplexity GMACs Params. (M )
Non-factorized ResNet-18 93.86±0.20 0.56 11.17 Non-factorized Transformer 9.85±0.10 1.370 53.90
Pufferfish ResNet-18 94.17 0.22 3.336 Pufferfish∗ Transformer 7.34±0.12 0.996 26.70
Cuttlefish ResNet-18 93.47 0.3 3.108 M AESTRO† Transformer 6.90±0.07 0.248±0.0032 13.80±0.113
IMP ResNet-18 92.12 - 0.154 ∗ † i
RareGems ResNet-18 92.83 - 0.076 Results from original work; tuned λgp from {2 /100; i ∈ 0, . . . , 9}
XNOR-Net ResNet-18 90.06 - 0.349†
M AESTRO†
ResNet-18 94.19±0.07 0.39±0.00 4.08±0.02
Table 4: Ablation study for ResNet18 on CIFAR10
(λgp = 16e−6 )
M AESTRO†
Variant Acc. (%) GMACs Params. (M )
ResNet-18 93.86±0.11 0.15±0.00 1.23±0.00
(λgp = 64e−6 ) M AESTRO 94.19±0.39 0.39±0.0008 4.08±0.020
Non-factorized VGG-19 92.94±0.17 0.40 20.56 w/out GL 94.04±0.10 0.56±0.0000 11.2±0.000
Pufferfish VGG-19 92.69 0.29 8.37 w/out PS 94.12±0.36 0.39±0.0010 4.09±0.027
Cuttlefish VGG-19 93.39 0.15 2.36 w/ full-training 94.05±0.32 0.39±0.0004 4.09±0.032
RareGems VGG-19 86.28 - 5.04
IMP VGG-19 92.86 - 5.04
XNOR-Net VGG-19 88.94 - 0.64†
∗
Spectral Init. VGG-19 83.27 - ≈ 0.4
M AESTRO†
VGG-19 93.10±0.10 0.13±0.00 2.20±0.03
(λgp = 32e−6 )
M AESTRO†
VGG-19 88.53±0.13 0.03±0.00 0.35±0.00
(λgp = 512e−6 )
∗
Results from original work; †: XNOR-Net employs binary weights and
activations; although the overall #trainable parameters remain the same as
the vanilla network, each model weight is quantized from 32-bit to 1-bit.
Therefore, we report a compression rate of 3.125%(1/32).
preserving model accuracy by leveraging low-rank approximations. For ResNet-18, we achieve
94.19±0.07% for 4.08M parameters and 93.97±0.25% for 2.19M parameters compared to the
94.17% of Pufferfish at 3.3M parameters. For VGG-19, we achieve +0.41pp (percentage points)
higher accuracy compared to Pufferfish and -0.29pp to Cuttlefish at 44.8% and 93.2% of the sizes,
respectively. Finally, comparing with the spectral initialization [23] for VGG-19, we achieve +5.26pp
higher accuracy for 87.5% of parameter size. Detailed results are shown in Tab. 2. This performance
benefits also apply in the case of Transformers (Tab. 3), where M AESTRO performs 6% better in terms
of perplexity at 25% of the cost (MACs) and 51.7% of the size (parameters) compared to Pufferfish.
Comparisons with pruning methods. The next family of baselines is related to the LTH [11].
Specifically, we compare against IMP [40] and witness from Tab. 2 that M AESTRO can achieve
+1.25pp (λgp = 128e−6 ) and +0.24pp (λgp = 32e−6 ) higher accuracy for ResNet-18 and VGG-19
respectively. Although we cannot scale to the size that RareGems [48] for ResNet-18, the sparsity
that they achieve is unstructured, which most modern hardware cannot take advantage of. In contrast,
our technique performs ordered structured sparsity, compatibly with most computation targets. On
the other hand, for VGG-19, we achieve +6.82pp higher accuracy at 43.6% of the footprint.
Comparisons with quantized models. We also compare against XNOR-Net [41], which binarizes
the network to achieve efficient inference. Training continues to happen in full precision, and
inference performance is really dependent on the operation implementation of the target hardware.
Nonetheless, assuming a compression rate of 3.125%, for the same size of network, we achieve
+1.08pp (λgp = 512e−6 ) and +2.18pp (λgp = 256e−6 ) higher accuracy on ResNet-18 and VGG-19.
5.3 Training behaviour of M AESTRO
Having shown the relative performance of our framework to selected baselines, we now move to
investigate how our method behaves, with respect to its convergence and low-rank approximations.
Model and rank convergence. In Fig. 3, we present the training dynamics for M AESTRO. Fig. 3a
illustrates the evolution of total rank throughout the training steps. We observe that the ranks are
pruned incrementally. This aligns with the observations made during Pufferfish [53] training, where
the authors suggest warm-start training with full precision to enhance the final model performance.
In our situation, we do not need to integrate this heuristic because M AESTRO automatically prunes
rank. Fig. 3b reveals the ranks across layers after training. We notice an intriguing phenomenon: the
ranks are nested for increasing λgl . This could imply apart from a natural order of ranks within each
layer, a global order. We briefly examine this captivating occurrence in the following section, and we
plan to investigate it more thoroughly in future work, as we believe this might contribute to a superior
rank selection and sampling process. Lastly, Fig. 3c depicts the progression of training loss. We find
that our hypothesis, that sampling does not adversely impact training, is also supported empirically.
5.4 Ablation study
In this section, we examine the impact of each component on the performance of M AESTRO.
Specifically, we run variants of our method i) without the hierarchical group lasso regularization

8
500 Full Rank 100
4000 Maestro: = 4e-06
Maestro: = 8e-06
Maestro: = 4e-06 400 Maestro: = 1.6e-05
Maestro: = 8e-06 Maestro: = 3.2e-05
3000 Maestro: = 1.6e-05 Maestro: = 6.4e-05 10 1

Total Rank
300

Train Loss
Maestro: = 3.2e-05 Maestro: = 0.000128

Rank
Maestro: = 6.4e-05 Maestro: = 0.000256
2000 Maestro: = 0.000128 Maestro: = 0.000512
10 2
Maestro: = 0.000256 200 Maestro: = 0.001024
Maestro: = 0.000512
1000 Maestro: = 0.001024
100
10 3

0 0
0 50 100 150 200 250 300 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 50 100 150 200 250 300
Epochs Layer Index Epochs

(a) Total rank ( di=1 ri ). (b) Ranks ri ’s after training. (c) Convergence for λgl = 0.
P

Figure 3: Training dynamics of M AESTRO for ResNet18 on CIFAR10.

Maestro
SVD 90
90 93
80
85 92 70
Test Accuracy

Test Accuracy

Test Accuracy
80 91 60
75 50
90 40
70 30
89 Pruned Maestro Model
65 Maestro (With Hierarchical Regularization) 20 Maestro models with varying
1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
MACS 1e8 MACS 1e8 MACS 1e8

(a) M AESTRO vs. SVD. (b) Varying HGL. (c) Nested M AESTRO.
Figure 4: Accuracy-latency trade-off of M AESTRO under different settings for VGG19 on CIFAR10.
(HGL), ii) without progressive shrinking (PS). Additionally, we integrate iii) an extra full low-rank
pass (b = ri ) into the training at each step to assess whether extra sampling would be beneficial.
The results are displayed in Tab. 4. As anticipated, our findings confirm that neither inclusion of
hierarchical group lasso with a tuned λgl nor progressive shrinking impair the final performance,
but they do significantly enhance the efficiency of M AESTRO. Moreover, sampling more ranks at
each training step does not improve the final performance, and, in fact, it hampers training efficiency,
making it approximately twice as computationally demanding.
5.5 Accuracy-latency trade-off at training and deployment time
In Fig. 4, we illustrate various approaches to balance latency (proxied through MACs operations)
and accuracy in model training and deployment. Fig. 4a demonstrates how M AESTRO (λgl = 0) can
be pruned effectively for deployment using the greedy search method discussed in Section 3.4. We
contrast this with the greedy pruning of a non-factorized model that has been factorized using SVD.
We reveal that this straightforward baseline does not measure up to the learned decomposition of
M AESTRO and results in a significant performance decrease. Next, Fig. 4b portrays the final accuracy
and the number of model parameters for varying hierarchical group lasso penalties. This leads to the
optimal latency-accuracy balance for both training and inference. However, it’s crucial to point out
that each model was trained individually, while greedy pruning only necessitates a single training
cycle. Lastly, we delve into the observation of nested ranks across increasing λgl . Fig. 4c displays the
performance of M AESTRO (λgl = 0) across different ranks selected by smaller models M AESTRO
(λgl > 0). Intriguingly, we observe that M AESTRO (λgl = 0) performs very well—for instance, we
can decrease its operations in half (and parameters by 10×) and still maintain an accuracy of 87.7%
without fine-tuning, just by reusing rank structure from independent runs. As aforementioned, we
intend to further explore this in the future.

6 Conclusion and future work

In this work, we have presented M AESTRO, a method for trainable low-rank approximation of DNNs
that leverages progressive shrinking by applying a generalized variant of Ordered Dropout to the
factorized weights. We have shown the theoretical guarantees of our work in the case of linear
models and empirically demonstrated its performance across different types of models, datasets, and
modalities. Our evaluation has demonstrated that M AESTRO outperforms competitive compression
methods at a lower cost. In the future, we plan to expand our technique to encompass more advanced
sampling techniques and apply it to different distributed learning scenarios, such as Federated
Learning, where data are natively non-independent or identically distributed (non-IID).

9
References
[1] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. Fedrolex: Model-heterogeneous federated learning
with rolling sub-model extraction. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun
Cho, editors, Advances in Neural Information Processing Systems, 2022.
[2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-
efficient sgd via gradient quantization and encoding. Advances in neural information processing systems,
30, 2017.
[3] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli.
The convergence of sparsified gradient methods. arXiv preprint arXiv:1809.10505, 2018.
[4] Mario Almeida, Stefanos Laskaridis, Abhinav Mehrotra, Lukasz Dudziak, Ilias Leontiadis, and Nicholas D
Lane. Smart at what cost? characterising mobile deep neural networks in the wild. In Proceedings of the
21st ACM Internet Measurement Conference, pages 658–672, 2021.
[5] Sebastian Caldas, Jakub Konečný, Brendan McMahan, and Ameet Talwalkar. Expanding the reach of
federated learning by reducing client resource requirements, 2019.
[6] Miguel A Carreira-Perpinán and Yerlan Idelbayev. “learning-compression” algorithms for neural net
pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
8532–8541, 2018.
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009.
[8] Enmao Diao, Jie Ding, and Vahid Tarokh. Hetero{fl}: Computation and communication efficient federated
learning for heterogeneous clients. In International Conference on Learning Representations, 2021.
[9] Łukasz Dudziak, Mohamed S Abdelfattah, Ravichander Vipperla, Stefanos Laskaridis, and Nicholas D
Lane. Shrinkml: End-to-end asr model compression using reinforcement learning. INTERSPEECH, 2019.
[10] D. Elliott, S. Frank, K. Sima’an, and L. Specia. Multi30k: Multilingual english-german image descriptions.
pages 70–74, 2016.
[11] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural
networks. In International Conference on Learning Representations, 2019.
[12] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with
pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international conference
on computer vision, pages 1026–1034, 2015.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[15] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In
Proceedings of the IEEE international conference on computer vision, pages 1389–1397, 2017.
[16] Samuel Horváth, Aaron Klein, Peter Richtárik, and Cédric Archambeau. Hyperparameter transfer learning
with adaptive complexity. In International Conference on Artificial Intelligence and Statistics, pages
1378–1386. PMLR, 2021.
[17] Samuel Horváth, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos Venieris, and Nicholas
Lane. FjORD: Fair and accurate federated learning under heterogeneous targets with ordered dropout.
Advances in Neural Information Processing Systems, 34:12876–12889, 2021.
[18] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges-
mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International
Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
[19] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision
applications. arXiv preprint arXiv:1704.04861, 2017.
[20] Edward Hu, Yelong Shen, Phil Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu Chen. Lora:
Low-rank adaptation of large language models, 2021.
[21] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron
pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
[22] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with
low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

10
[23] Mikhail Khodak, Neil Tenenholtz, Lester Mackey, and Nicolo Fusi. Initialization and regularization of
factorized neural layers. arXiv preprint arXiv:2105.01029, 2021.
[24] Minjae Kim, Sangyoon Yu, Suhyun Kim, and Soo-Mook Moon. DepthFL : Depthwise federated learning
for heterogeneous clients. In The Eleventh International Conference on Learning Representations, 2023.
[25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[26] Stefanos Laskaridis, Alexandros Kouris, and Nicholas D Lane. Adaptive inference through early-exit
networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on
Embedded and Mobile Deep Learning, pages 1–6, 2021.
[27] Stefanos Laskaridis, Stylianos I Venieris, Alexandros Kouris, Rui Li, and Nicholas D Lane. The future of
consumer edge-ai computing. arXiv preprint arXiv:2210.10514, 2022.
[28] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
[29] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online].
Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
[30] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient
convnets. arXiv preprint arXiv:1608.08710, 2016.
[31] Michael Lim and Trevor Hastie. Learning interactions via hierarchical group-lasso regularization. Journal
of Computational and Graphical Statistics, 24(3):627–654, 2015.
[32] Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-device training
under 256kb memory. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
[33] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network
pruning. arXiv preprint arXiv:1810.05270, 2018.
[34] Zicheng Liu, Da Li, Javier Fernandez-Marques, Stefanos Laskaridis, Yan Gao, Łukasz Dudziak, Stan Z Li,
Shell Xu Hu, and Timothy Hospedales. Federated learning for inference at anytime and anywhere. arXiv
preprint arXiv:2212.04084, 2022.
[35] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness
of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages
3325–3334. PMLR, 2018.
[36] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and
statistics, pages 1273–1282. PMLR, 2017.
[37] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming
Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[39] David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel
Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training
will plateau, then shrink. Computer, 55(7):18–28, 2022.
[40] Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, and Gintare Karolina
Dziugaite. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? In The
Eleventh International Conference on Learning Representations, 2023.
[41] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi-
fication using binary convolutional neural networks. In Computer Vision–ECCV 2016: 14th European
Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV, pages 525–542.
Springer, 2016.
[42] Oren Rippel, Michael Gelbart, and Ryan Adams. Learning Ordered Representations with Nested Dropout.
In International Conference on Machine Learning (ICML), pages 1746–1754, 2014.
[43] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank
matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE
international conference on acoustics, speech and signal processing, pages 6655–6659. IEEE, 2013.
[44] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth
condition. arXiv preprint arXiv:1308.6370, 2013.
[45] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its
application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the
international speech communication association, 2014.
[46] Hakim Sidahmed, Zheng Xu, Ankush Garg, Yuan Cao, and Mingqing Chen. Efficient and private federated
learning with partially trainable networks. arXiv preprint arXiv:2110.03450, 2021.

11
[47] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. In International Conference on Learning Representations, 2015.
[48] Kartik Sreenivasan, Jy yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing,
Kangwook Lee, and Dimitris Papailiopoulos. Rare gems: Finding lottery tickets at initialization. In Alice H.
Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information
Processing Systems, 2022.
[49] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation
with limited communication. In International Conference on Machine Learning, pages 3329–3337. PMLR,
2017.
[50] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In
International conference on machine learning, pages 6105–6114. PMLR, 2019.
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing
systems, pages 5998–6008, 2017.
[52] Erwei Wang, James J Davis, Ruizhe Zhao, Ho-Cheung Ng, Xinyu Niu, Wayne Luk, Peter YK Cheung,
and George A Constantinides. Deep Neural Network Approximation for Custom Hardware: Where we’ve
been, where we’re going. ACM Computing Surveys (CSUR), 52(2):1–39, 2019.
[53] Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: communication-efficient models
at no extra cost. Proceedings of Machine Learning and Systems, 3:365–386, 2021.
[54] Hongyi Wang, Saurabh Agarwal, Yoshiki Tanaka, Eric P Xing, Dimitris Papailiopoulos, et al. Cuttlefish:
Low-rank model training without all the tuning. arXiv preprint arXiv:2305.02538, 2023.
[55] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep
neural networks. Advances in neural information processing systems, 29, 2016.
[56] Simon Wiesler, Alexander Richard, Ralf Schlüter, and Hermann Ney. Mean-normalized stochastic gradient
for large-scale deep learning. In 2014 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 180–184. IEEE, 2014.
[57] Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria
Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. Sustainable ai: Environmental implications, challenges
and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022.
[58] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular
value decomposition. In Interspeech, pages 2365–2369, 2013.
[59] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks
using energy-aware pruning. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 5687–5695, 2017.
[60] Tian Ye and Simon S Du. Global convergence of gradient descent for asymmetric low-rank matrix
factorization. Advances in Neural Information Processing Systems, 34:1429–1439, 2021.
[61] Jiahui Yu and Thomas Huang. Autoslim: Towards one-shot architecture search for channel numbers. arXiv
preprint arXiv:1903.11728, 2019.
[62] Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In
Proceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019.
[63] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. In
International Conference on Learning Representations, 2019.
[64] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
[65] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model
compression. arXiv preprint arXiv:1710.01878, 2017.

12
Appendix
Contents of the Appendix
A Broader impact 13

B Limitations 13

C Extended Background 14

D Theoretical Properties of Low-Rank Layers 14

E Experimental setup 16
E.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
E.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
E.3 Hyperparameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
E.4 Deciding against decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

F Extended evaluation 18
F.1 M AESTRO recovers correct ordering . . . . . . . . . . . . . . . . . . . . . . . . . 18
F.2 Training behaviour of M AESTRO . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
F.3 Model size-accuracy trade-off at training and deployment time . . . . . . . . . . . 19

A Broader impact

The goal of our work is to make the training and deployment of DNNs more efficient, affecting the
total computation, memory and bandwidth of systems, as well as the energy they require to run the
respective tasks. DNN model training requires significant amounts of energy, whether in a data center
or at the edge [57, 39]. However, such techniques should not be used as an excuse to make data
centers less green, but rather as a complementary measure to further reduce the carbon footprint of
Deep Learning.
Additionally, as our technique involves a training-aware methodology for progressively selecting
ranks, it depends on the quality of data used in training. Deploying the model in the wild for various
downstream tasks may result in behavior different from the intended outcomes. Therefore, it should
be thoroughly tested before deployment to ensure it adheres to the required Service Level Objectives
(SLOs), especially in performance-critical use cases, such as self-driving vehicles or UAV navigation.

B Limitations

In this work, we have proposed a method for trainable low-rank approximation of DNNs that provides
performance benefits for both training and inference times. While we suggest that this could have
repercussions on the energy consumption of these tasks, we have not yet evaluated this hypothesis
experimentally across different devices, be they data center-grade or at the edge.
Additionally, we have applied our technique to CNN and Transformer models spanning across vision
and NLP tasks. While we anticipate generalization to any type of network, it remains to be seen
whether our techniques can also be applied to alternative types of layers, such as recurrent ones, and
the benefits they may bring.
Although we have provided a thorough investigation of the behaviour of our proposed system,
the only way we can control the end footprint of the model during training is via the λgl and εps
hyperparameters. However, there is no guarantee about the final footprint of the model. If we are
willing to sacrifise accuracy, then the technique illustrated in Sec. 3.4 and evaluated in Sec. 5.5 is a
start. More robust ways of globally ranking per-layer importances are left as future work.
Lastly, our sampling method during training is uniform up to the maximum rank during progressive
shrinking. Although this method has proven effective, alternative sampling methods could potentially

13
accelerate rank exploration, thereby hastening the shrinking and convergence of the network during
training.

C Extended Background
Ordered Dropout. Ordered Dropout is a technique of importance-based, nested and ordered pruning
that works along the indices of a layer’s parameters (neurons, filters, etc.) Introduced by [17],
the authors describe a training technique where a layer’s width is discretised in |P | values, where
P = {s1 , s2 , . . . , s|P | }, and at each
training step, they sample p ∼ UP to get a specific subnetwork,
extracted by selecting the first p ∗ Kl − 1 neurons per layer and dropping the rest. In contrast to
our work, sampling is happening directly on model parameters (rather than ranks) and is uniform
across layers (i.e. a single p-value is set). Nested-ness refers to the fact that larger p-value models
include the parameters of lower p-values and importance-based pruning means that via stochastic
sampling, the right-most (in terms of index) parameters train on progressively less data due to the
probability of sampling and nestedness (i.e. all data pass from the parameters of minimal subnetwork,
less pass the higher the p-value).
D Theoretical Properties of Low-Rank Layers
In this section, we show that for the case of linear mappings, i.e., the problem formulation discussed in
(3), M AESTRO acts as PCA applied to the original dataset X projected onto the space weighted by the
corresponding singular values. Before proceeding with the theorem, we first recall the assumptions
and notations introduced in the main paper.
We denote C:b as the first b columns of matrix C, C:a,:b denotes the first a rows, and b columns of a
matrix C, a+1 : denotes the all the columns/rows from index a+1, : denotes the all the columns/rows,
and for vectors, we use a single subscript. As discussed in the main paper, we reformulate the original
least squares problems to the following decomposition problem
h h 2 ii
min Ex,y∼X Eb∼D U:b V:b⊤ x − y , (6)
U ∈Rm×r n×r
,V ∈R

where D is a distribution that samples b ∈ {1, 2, . . . , r} with probability pb > 0 and we assume that
y is linked with x through linear map A, i.e., y = Ax.
Theorem D.1. Let A = Ũ Σ̃Ṽ ⊤ be a SVD decomposition of A. Then, the minimization problem (6)
is equivalent to PCA applied to the transformed dataset x → Σ̃Ṽ ⊤ x, x ∼ X projected on the column
space of Ũ . Concretely, we can first solve
2
⊤ ⊤

min Ez∼X Eb∼D U:b V:b − I Σ̃Ṽ x , (7)

U ∈Rm×r ,V ∈Rn×r

and then we can obtain the solutions of (6) using U ⋆ = Ũ ⊤ Ū , V ⋆ = Ṽ ⊤ V̄ , where Ū , V̄ belong to
the set of optimal solutions of problem (7).zx
In the particular case, where X is a uniform distribution on the unit ball, (6) recovers the best rank
approximation of A across all ranks, i.e., up to the scale of U and V recovers truncated SVD. In the
case, A is identity, (6) leads to standard PCA decomposition.

Proof. From the assumptions that y = Ax and A = Ũ Σ̃Ṽ ⊤ , we can rewrite (6) as
2
⊤ ⊤
min Ex∼X Eb∼D U:b V:b − Ũ Σ̃Ṽ x .

U ∈Rm×r ,V ∈Rn×r

Since Ũ is orthogonal, we have ∥z∥ = ∥Ũ ⊤ z∥. Therefore, the above problem is equivalent to
2
min Ex∼X Eb∼D Ũ ⊤ U:b V:b⊤ − Σ̃Ṽ ⊤ x ,

U ∈Rm×r ,V ∈Rn×r

14
which is also equivalent to
2
⊤ ⊤
min Ex∼X Eb∼D U:b V:b − Σ̃Ṽ x

U ∈Rm×r ,V ∈Rn×r

after reparametrization. The next step involves injecting identity in the form Ṽ Ṽ ⊤ as that leads to the
equivalent reformulation
2
⊤ ⊤
min E E U V Ṽ − Σ̃ Ṽ x .

x∼X b∼D :b :b
U ∈Rm×r n×r
,V ∈R

As for the previous case, we can reparametrise the problem to obtain

2
min Ex∼X Eb∼D U:b V:b⊤ − Σ̃ Ṽ ⊤ x .

m×r
U ∈R n×r
,V ∈R

Let k = rank(Σ̃) = rank(A) ≤ r and z = Ṽ ⊤ x. Furthermore, let g = Σ̃z for any z ∈ Rn , then
gk+1: = ⃗0. This, combined with the nested structure of the optimization problem, implies that the
optimal solution for U has to be of the form ui,k+1: = ⃗0 for all interesting (non-zero mapping)
directions, i.e., there exists x ∈ X such that vi⊤ Ṽ ⊤ x ̸= 0. These are the only interesting solutions
since the case where for all x ∈ X : vi⊤ Ṽ ⊤ x = 0 yields zero mapping on X , which is not of interest
and could be dropped, e.g., using group lasso penalty discussed in the main part. Therefore, to solve
the original problem, we could first solve the following problem
2
⊤
min Ez∼X Eb∼D U:k,:b V:b − Σ̃:k,: z

U ∈R k×r n×r
,V ∈R

and then reconstruct the corresponding solution of the original problem by appending zeros to the
resulting matrix U . By a similar argument, we can argue that for all non-zero mapping directions, it
has to be the case that vi,k+1: = ⃗0. Therefore, solving the original minimization reduces to
2
⊤
min Ez∼X Eb∼D U:b V:b − Σ̃:k,:k z:k .

U ∈Rk×r ,V ∈Rk×r

This can be further simplified using reparametrization V ⊤ → V ⊤ Σ̃−1

:k,:k , which leads to
2
Ez∼X Eb∼D U:b V:b⊤ − Ik Σ̃:k,:k z:k

min , (8)

U ∈Rk×r k×r
,V ∈R

where Ik is k × k identity. If X is centred around zero, then Σ̃:k,:k z:k is also centred around zero,
and the above problem is up to scaling equivalent to PCA of Σ̃:k,:k z:k as shown by Rippel et al. [42].
Since Σ̃ is a diagonal matrix with only k × k non-zero upper left sub-matrix, therefore, PCA on
Σ̃:k,:k z:k is equivalent to PCA on Σ̃z by appending zeros to the obtained principal component vectors.
Thus, we can write an equivalent formulation
2
⊤ ⊤

min E E U V − I Σ̃ Ṽ x .

z∼X b∼D :b :b
U ∈R m×r,V ∈Rn×r

Furthermore, let Ū , V̄ belong to the set of optimal solutions of problem (7). Then U ⋆ = Ũ ⊤ Ū , V ⋆ =
Ṽ ⊤ V̄ belong to the set of optimal solutions of problem (6). This can be proved by reversing our
construction and ignoring scaling since (7) is scaling invariant.
For the case X is a uniform distribution on the unit ball, we have Σ̃:k,:k z:k is a k-dimensional ellipsoid
k
with principal axes being standard basis vectors {ei }i=1 , where the length of the axes is given by
ordered singular values, i.e., the first basis vector corresponds to the largest singular vector. Therefore,
its principal component vectors correspond to the basis vectors. Following our construction, one can
see that the solution to the original problems leads to truncated SVD up to the scaling factor.
For the case A is an identity, we have k = r = m = m, Σ̃ is an identity, and Ũ = Ṽ . Under this
setting, the principal component vectors obtained from (8) corresponds to principal component vectors
of X in basis given by columns of Ũ . Similarly, as in the previous case, reversing the transformations
to return back to the original problem, we conclude that the optimal solution of the original problem
corresponds to principal component vectors of X since we reverse the transformation by Ũ ⊤ .

15
E Experimental setup

E.1 Datasets

MNIST. The MNIST dataset [29] is a database of 28×28 greyscale handwritten digits, with a training
set of 60k examples and a test set of 10k samples.
CIFAR-10. The CIFAR10 dataset [25] is a computer vision dataset that consists of 32×32 RGB
images classified into 10 labels. It is split into 50k training images and 10k test images which are
balanced across labels.
WMT16. The WMT dataset from statmt is machine translation dataset, spanning news commentaries
and parliament proceedings, that aims to investigate the applicability of machine translation techniques
when translating between language pairs. Specifically, we focus on the task of German-English
language translation of image descriptions, commonly referred to as Multi30k [10]. We only utilise
the text modality for the translation task. Data is taken straight from torchtext.
TinyImagenet. The TinyImagenet dataset [28] is a image classification challenge similar to
ILSVRC [7]. The task it to classify an 64×64 RGB image among 200 classes, with each class
having 500 training samples. The test set contains 10,000 images.

E.2 Models

LeNet. LeNet is a simple convolutional network, introduced by LeCun at al. for recognizing
handwritten digits [29]. It consists of a sequence of two convolutional layers, followed by three
fully-connected layers. However, we are using a ReLU instead of the initially proposed sigmoid
activation. The detailed architecture of the network is depicted in Tab. 5
ResNet. ResNet [14] is a deep neural network whose prominent feature is the existence of skip (or
residual) connections, that is connections that perform identity mappings merged with the target layer
it joins with through summation. Multiple residual blocks are stacked to form the network. The result
is an easier to optimise network that offers enhanced accuracy. We use ResNet-18 in our experiments,
the architecture of which is depicted in Tab. 6, except for TinyImageNet, where we use ResNet-50.

Table 6: The hybrid ResNet architecture for the CIFAR-

10 and TinyImageNet datasets used in the experiments.
Layer Name ResNet-18 ResNet-50
Table 5: Detailed architecture of the LeNet-5 archi- conv1 3×3, 64, stride 1, padding 1 7×7, 64, stride 2, padding 1
tecture used in our experiments. Each convolution 3×3 maxpool, stride 2
and linear layer is followed by a ReLU activation conv2_x

3×3, 64
"
1×1, 64
#
×2 3×3, 64 ×3
that is ommitted from the table. The shapes for 3×3, 64
1×1, 256
convolution layers follows (m, n, k, k). "
1×1, 128
#
3×3, 128
Parameter Shape Layer hyper-parameter conv3_x
3×3, 128 ×2 3×3, 128 ×4
1×1, 512
layer1.conv1.weight 1×6×5×5 stride:1;padding:1 " #
1×1, 256
pooling.max N/A kernel size:2;stride:1;dilation:1 3×3, 256
conv4_x
3×3, 256 ×2 3×3, 256 ×6
layer2.conv2.weight 6 × 16 × 5 × 5 stride:1;padding:0;dilation:1 1×1, 1024
pooling.max N/A kernel size:2;stride:2 "
1×1, 512
#
3×3, 512
layer3.fc1.weight 256 × 120 N/A conv5_x
3×3, 512 ×2 3×3, 512 ×3
layer4.fc2.weight 120 × 84 N/A 1×1, 2048
layer5.fc3.weight 84 × 10 N/A Avg Pool, 10-dim FC, SoftMax Avg Pool, 20-dim FC, SoftMax

VGG. VGG [47] is a also a convolutional network that leverages smaller 3×3 convolutions that
enables deeper architecture than before. For our experiments we are using VGG-19, the architecture
of which is depicted in Tab. 7.
Transformers. The transformer architecture [51] has been lately revolutionising deep learning.
Based on the notion of self-attention, for each input token, it produces a weighted combination of
other relevant tokens weighed by the attention weight. Each attention unit has three weight matrices,
namely WQ , WK , WV , for query, key and value weights respectively producing the equivalent
vectors. Attention is defined as the scaled dot product between key and query. For our translation
task, we use the architecture depicted in Tab. 9.

16
Table 7: Detailed architecture of the VGG-19 architecture used in our experiments. There is a BatchNorm layer
followed by a ReLU activation (omitted in the table) after each convolution layer. The shapes for convolution
layers follows (m, n, k, k).
Parameter Shape Layer hyper-parameter

layer1.conv1.weight 3 × 64 × 3 × 3 stride:1;padding:1
layer2.conv2.weight 64 × 64 × 3 × 3 stride:1;padding:1
pooling.max N/A kernel size:2;stride:2
layer3.conv3.weight 64 × 128 × 3 × 3 stride:1;padding:1
layer4.conv4.weight 128 × 128 × 3 × 3 stride:1;padding:1
pooling.max N/A kernel size:2;stride:2
layer5.conv5.weight 128 × 256 × 3 × 3 stride:1;padding:1
layer6.conv6.weight 256 × 256 × 3 × 3 stride:1;padding:1
layer7.conv7.weight 256 × 256 × 3 × 3 stride:1;padding:1
layer8.conv8.weight 256 × 256 × 3 × 3 stride:1;padding:1
pooling.max N/A kernel size:2;stride:2
layer9.conv9.weight 256 × 512 × 3 × 3 stride:1;padding:1
layer10.conv10.weight 512 × 512 × 3 × 3 stride:1;padding:1
layer11.conv11.weight 512 × 512 × 3 × 3 stride:1;padding:1
layer12.conv12.weight 512 × 512 × 3 × 3 stride:1;padding:1
pooling.max N/A kernel size:2;stride:2
layer13.conv13.weight 512 × 512 × 3 × 3 stride:1;padding:1
layer14.conv14.weight 512 × 512 × 3 × 3 stride:1;padding:1
layer15.conv15.weight 512 × 512 × 3 × 3 stride:1;padding:1
layer16.conv16.weight 512 × 512 × 3 × 3 stride:1;padding:1
pooling.avg N/A kernel size:2
classifier.weight 512 × 10 N/A
classifier.bias 10 N/A

Table 9: Detailed information of the decoder layer in

the Transformer architecture in our experiment
Parameter Shape Hyper-param.

embedding 9521 × 512 padding index: 1

Table 8: Detailed information of the encoder layer in positional encoding N/A N/A
the Transformer architecture in our experiment dropout N/A p = 0.1
decoder.self-attention.wq(W Q ) 512 × 512 N/A
Parameter Shape Hyper-param. decoder.self-attention.wk(W K ) 512 × 512 N/A

embedding 9521 × 512 padding index: 1

decoder.self-attention.wv(W V ) 512 × 512 N/A
decoder.self-attention.wo(W O ) 512 × 512 N/A
positional encoding N/A N/A
decoder.self-attention.dropout N/A p = 0.1
dropout N/A p = 0.1
Q decoder.self-attention.layernorm 512 ε = 10−6
encoder.self-attention.wq(W ) 512 × 512 N/A
decoder.enc-attention.wq(W Q ) 512 × 512 N/A
encoder.self-attention.wk(W K ) 512 × 512 N/A
decoder.enc-attention.wk(W K ) 512 × 512 N/A
encoder.self-attention.wv(W V ) 512 × 512 N/A
decoder.enc-attention.wv(W V ) 512 × 512 N/A
encoder.self-attention.wo(W O ) 512 × 512 N/A
decoder.enc-attention.wo(W O ) 512 × 512 N/A
encoder.self-attention.dropout N/A p = 0.1
decoder.enc-attention.dropout N/A p = 0.1
encoder.self-attention.layernorm 512 ε = 10−6 decoder.enc-attention.layernorm 512 ε = 10−6
encoder.ffn.layer1 512 × 2048 N/A decoder.ffn.layer1 512 × 2048 N/A
encoder.ffn.layer2 2048 × 512 N/A decoder.ffn.layer2 2048 × 512 N/A
encoder.layernorm 512 ε = 10−6 encoder.layernorm 512 ε = 10−6
dropout N/A p = 0.1 dropout N/A p = 0.1

E.3 Hyperparameter selection

LeNet. We use a standard configuration that is commonly employed for training LeNet models — a
step size of 0.01, a momentum of 0.9, and no weight decay. We train for a total of 20 epochs.

17
VGG and ResNet-18. Similarly, we use a standard configuration that is commonly employed for
training VGG and ResNet-18 models — a step size of 0.01, a momentum of 0.9, weight decay of
1e−4 , and a learning schedule with step size reductions by a factor of 10 at epochs 150 and 250. We
train for a total of 300 epochs.
ResNet-50. Similarly, we use a standard configuration that is commonly employed for training
ResNet-50 models — a step size of 0.01, a momentum of 0.9, weight decay of 1e−4 , and a learning
schedule with step size reductions by a factor of 10 at epochs 30 and 60. We train for a total of 90
epochs.
Transformers. For the Transformer model, we use the Adam optimizer with an initial learning rate
at 0.001, βs = (0.9, 0.98), ε = 10−8 batch size at 256. We also conduct gradient norm clipping with
norm bound at 0.25. The entire training takes 400 epochs. For the vanilla warm-up training, we use
warm-up epoch Ewu = 10. We enable label smoothing, weight sharing for the source and target
word embedding, and weight sharing between target word embedding and the last dense layer. The
learning rate schedule follows directly from the one proposed [51].

E.4 Deciding against decomposition

During inference, if the rank of a given layer is so large that keeping it as a non-decomposed layer is
more efficient, we opt not to decompose that particular layer.

F Extended evaluation
F.1 M AESTRO recovers correct ordering

In the main text, we pointed out that SVD fails to consider data. Indeed, even in the case of linear NN,
the acquired singular vectors may exhibit incorrect ordering. To illustrate this problem, we provide a
simple example in which we use a matrix A with a rank of 3. We organize the dataset X such that the
third singular vector has the highest importance, followed by the second and then the first singular
vector in decreasing order of significance. It is clear that SVD doesn’t consider the data, and as a
result, it cannot comprehend this behavior. Below (in Fig. 5), we demonstrate how M AESTRO is able
to correctly discern the order.

k = ... (p = ...)
Singular Values

6 1 (0.2)
2 (0.3)
3 (0.5)
4 (0.7)
4 5 (0.8)
6 (1.0)

0
0 1000 2000 3000 4000 5000 6000 7000
Iterations

Figure 5: Verification that M AESTRO recovers correct order of importance. Target mapping is of rank 3, and the
dataset is constructed in such a way that the singular vectors have reversed the order of importance. p and k
stand for relative and actual rank, respectively.

F.2 Training behaviour of M AESTRO

For completeness, we also include an extended version of Fig. 3 from the main paper, where we
presented the training dynamics for M AESTRO. Fig.6, 7 and 8 present similar plots, but across both
MNIST and CIFAR-10. Specifically, Fig. 6 illustrates the evolution of total rank throughout the
training steps. We observe that the ranks are pruned incrementally. This aligns with the observations
made during Pufferfish [53] training, where the authors suggested warm-start training with full

18
precision to enhance the final model performance. In our case, the necessity to implement this
heuristic is avoided, as M AESTRO prunes rank automatically. Fig. 7 demonstrates the ranks across
layers post-training. An intriguing trend is observed: the ranks are nested for increasing λgl ,
suggesting a potential inherent ordering of ranks not only within each layer but also possibly a global
one. We provide a preliminary exploration of this fascinating pattern in the subsequent section and
intend to probe it more deeply in future studies. We believe this may enhance the rank selection
and sampling process. Finally, Fig. 8 portrays the evolution of the training loss. Our premise that
sampling does not negatively affect training is validated by empirical performance.

Maestro: = 4e-05
Maestro: = 8e-05 4000 5000
200 Maestro: = 0.00016
Maestro: = 0.00032 Maestro: = 4e-06
Maestro: = 0.00064 Maestro: = 8e-06 4000 Maestro: = 4e-06
Maestro: = 0.00128 3000 Maestro: = 1.6e-05 Maestro: = 8e-06
150
Total Rank

Total Rank

Total Rank
Maestro: = 0.00256 Maestro: = 3.2e-05 Maestro: = 1.6e-05
Maestro: = 6.4e-05 3000 Maestro: = 3.2e-05
2000 Maestro: = 0.000128 Maestro: = 6.4e-05
Maestro: = 0.000128
100 Maestro:
Maestro:
= 0.000256
= 0.000512
2000 Maestro: = 0.000256
Maestro: = 0.001024 Maestro: = 0.000512
1000 1000
50
0 0
2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epochs Epochs Epochs

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Pd
Figure 6: Total rank ( i=1 ri ) progression during training.

120 Full Rank 500 Full Rank 500

Maestro: = 4e-05 Maestro: = 4e-06
100 Maestro: = 8e-05 Maestro: = 8e-06
Maestro: = 0.00016 400 Maestro: = 1.6e-05 400 Full Rank
Maestro: = 0.00032 Maestro: = 3.2e-05 Maestro: = 4e-06
80 Maestro: = 0.00064 Maestro: = 6.4e-05 Maestro: = 8e-06
Maestro: = 0.00128 300 Maestro: = 0.000128 300 Maestro: = 1.6e-05
Rank

Rank

Rank
60 Maestro: = 0.00256 Maestro: = 0.000256 Maestro: = 3.2e-05
Maestro: = 0.000512 Maestro: = 6.4e-05
200 Maestro: = 0.001024 200 Maestro: = 0.000128
40 Maestro: = 0.000256
Maestro: = 0.000512
20 100 100

0 0 0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 2 4 6 8 10 12 14 16
Layer Index Layer Index Layer Index

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Figure 7: Ranks ri ’s across different layers after training.

100
100

10 1
Train Loss

Train Loss

10 1
10 2

10 1
10 3
10 2

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epochs Epochs Epochs

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Figure 8: Convergence of M AESTRO with λgl = 0.

F.3 Model size-accuracy trade-off at training and deployment time

In addition to the original illustrations, we present an extended interpretation of Fig. 4, where we

depict diverse strategies to maintain a balance between model size and accuracy in the process
of model training and deployment. In Fig. 9, we demonstrate the effective pruning of M AESTRO
(λgl = 0) for deployment, utilizing the greedy search methodology discussed in Section 3.4. This is
juxtaposed with the greedy pruning of a model not originally factorized but later factorized through
SVD. Our findings reveal that this straightforward baseline does not match the performance of
M AESTRO’s learned decomposition, leading to a considerable performance drop.

19
100 94 Maestro
95 SVD
90
90 92
85
Test Accuracy
85

Test Accuracy

Test Accuracy
90
80 80
88
75 75
70 Maestro
86
70
65 SVD
84 Maestro
65 SVD
00

0
0
0
0
0
0
0
0
00
00
00
00
00
00
00
00
0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
50
10
15
20
25
30
35
40
45
Number of Parameters Number of Parameters 1e7 Number of Parameters 1e7

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Figure 9: Accuracy-latency trade-off comparing M AESTRO (λgl=0 ) and SVD.
Subsequently, Fig. 10 displays the end accuracy and the count of model parameters corresponding to
various hierarchical group lasso penalties. This results in an optimal compromise between latency
and accuracy for both the training and inference stages. It’s worth noting, though, that each model
was trained separately, in contrast to greedy pruning, which demands just a single training round.
Additionally, we scrutinize the training expense for each model illustrated in Fig. 10, the results of
which are exhibited in Tables 10, 11, 12, 13 and 14, where we display and the final accuracy of the
model, MACs and the number of parameters for inference, and relative total training cost in terms of
the number of model parameters and MACs compared to the non-factorized model. Interestingly,
smaller models are not only advantageous in terms of inference efficiency, but they can also be trained
at a small portion of the cost required by full-rank models. On the downside, the smallest models
cause a non-negligible reduction in performance.

99.0 Maestro (With Hierarchical Regularization)

94 93
98.5
93
Test Accuracy

92
Test Accuracy

Test Accuracy
98.0
92 91
97.5
91 90
97.0
96.5 90 89
Maestro (With Hierarchical Regularization) Maestro (With Hierarchical Regularization)
00

0
0
0
0
0
0
0
0
00
00
00
00
00
00
00
00

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
50
10
15
20
25
30
35
40
45

Number of Parameters Number of Parameters 1e7 Number of Parameters 1e7

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Figure 10: Impact of hierarchical group lasso on the accuracy-memory trade-off. Exact values are provided in
Tables 10, 11 and 12, respectively.

Table 10: LeNet performance on MNIST for different regularization parameters. The last column in the table
displays the relative total training cost in terms of the number of Multiply-Accumulate operations (MACs) and
model parameters, compared to the non-factorized model.
Variant Acc. (%) MACs (Inf.) Params. (Inf.) Rel. MACs / Params. (Train.)
Non-Factorized 98.99±0.09 281640±0 (1.00×) 44426±0 (1.00×) 1.00× / 1.00×
M AESTRO (λgp = 0.) 99.06±0.09 281640±0 (1.00×) 44426±0(1.00×) 1.14×/ 1.49×
M AESTRO (λgp = 8e−5 ) 98.91±0.09 268577±389 (0.95×) 31363±0 (0.71×) 1.08×/ 1.14×
M AESTRO (λgp = 16e−5 ) 98.92±0.05 255369±217 (0.91×) 44426±217 (0.41×) 1.06×/ 0.80×
M AESTRO (λgp = 32e−5 ) 98.31±0.39 237084±6268 (0.84×) 18155±271 (0.26×) 0.93×/ 0.53×
M AESTRO (λgp = 64e−5 ) 98.20±0.49 178165±19098 (0.63×) 7996±662 (0.18×) 0.77×/ 0.33×
M AESTRO (λgp = 128e−5 ) 97.92±0.22 131789±8965 (0.47×) 6375±77 (0.14×) 0.54×/ 0.21×
M AESTRO (λgp = 256e−5 ) 96.65±0.14 99969±6252 (0.35×) 5293±214 (0.12×) 0.39×/ 0.14×

Lastly, we delve deeper into the observation of nested ranks with increasing λgl . Fig. 11 outlines
the performance of M AESTRO (λgl = 0) across various ranks chosen by smaller models M AESTRO
(λgl > 0). We observe that M AESTRO (λgl = 0) delivers impressive results—for example, we can
reduce its parameters by 10x for VGG while preserving an accuracy of 87.7% without any fine-tuning
simply by leveraging rank structure from separate runs. For LeNet, a reduction in model size by a
factor of three is achievable without sacrificing accuracy. Last, for ResNet-18 the reduction is 1.7×.
As highlighted earlier, we aim to delve deeper into this subject in future studies.

20
Table 11: ResNet-18 performance on CIFAR10 for different regularization parameters. The last column in the
table displays the relative total training cost in terms of the number of Multiply-Accumulate operations (MACs)
and model parameters, compared to the non-factorized model.
Variant Acc. (%) GMACs (Inf.) Params. (M) (Inf.) Rel. MACs / Params. (Train.)
Non-Factorized 93.86±0.20 0.56±0 (1.00×) 11.2±0 (1.00×) 1.00× / 1.00×
M AESTRO (λgp = 0.) 94.04±0.10 0.56±0 (1.00×) 11.2±0 (1.00×) 1.10× / 1.13×
M AESTRO (λgp = 4e−6 ) 94.22±0.16 0.55±0.0047 (1.00×) 11.1±0.030 (0.99×) 1.09× / 1.10×
M AESTRO (λgp = 8e−6 ) 94.09±0.01 0.49±0.0002 (0.89×) 7.41±0.004 (0.66×) 1.00× / 0.85×
M AESTRO (λgp = 16e−6 ) 94.19±0.07 0.39±0.0008 (0.70×) 4.08±0.020 (0.37×) 0.83× / 0.58×
M AESTRO (λgp = 32e−6 ) 93.97±0.25 0.25±0.0013 (0.45×) 2.19±0.007 (0.20×) 0.60× / 0.36×
M AESTRO (λgp = 64e−6 ) 93.86±0.11 0.15±0.0006 (0.27×) 1.23±0.004 (0.11×) 0.39× / 0.22×
M AESTRO (λgp = 128e−6 ) 93.37±0.07 0.094±0.0006 (0.17×) 0.79±0.009 (0.07×) 0.25× / 0.13×
M AESTRO (λgp = 256e−6 ) 92.48±0.04 0.064±0.0002 (0.12×) 0.54±0.006 (0.05×) 0.16× / 0.08×
M AESTRO (λgp = 512e−6 ) 91.14±0.16 0.044±0.0004 (0.08×) 0.37±0.007 (0.03×) 0.11× / 0.05×
M AESTRO (λgp = 1024e−6 ) 89.55±0.30 0.032±0.0002 (0.06×) 0.27±0.007 (0.02×) 0.07× / 0.03×

Table 12: VGG19 performance on CIFAR10 for different regularization parameters. The last column in the
table displays the relative total training cost in terms of the number of Multiply-Accumulate operations (MACs)
and model parameters, compared to the non-factorized model.
Variant Acc. (%) GMACs (Inf.) Params. (M) (Inf.) Rel. MACs / Params. (Train.)
Non-Factorized 92.94±0.17 0.40±0 (1.00×) 20±0 (1.00×) 1.00× / 1.00×
M AESTRO (λgp = 0.) 93.06±0.17 0.40±0 (1.00×) 20±0 (1.00×) 1.10× / 1.12×
M AESTRO (λgp = 4e−6 ) 93.33±0.08 0.39±0.0017 (0.97×) 18.8±0 (0.94×) 1.06× / 1.04×
M AESTRO (λgp = 8e−6 ) 93.27±0.33 0.30±0.0017 (0.76×) 9.91±0.008 (0.49×) 0.90× / 0.73×
M AESTRO (λgp = 16e−6 ) 93.13±0.07 0.21±0.0014 (0.53×) 4.66±0.052 (0.23×) 0.69× / 0.46×
M AESTRO (λgp = 32e−6 ) 93.10±0.10 0.13±0.0009 (0.33×) 2.20±0.025 (0.11×) 0.47× / 0.27×
M AESTRO (λgp = 64e−6 ) 92.70±0.34 0.08±0.0005 (0.20×) 1.17±0.010 (0.06×) 0.30× / 0.16×
M AESTRO (λgp = 128e−6 ) 92.34±0.12 0.05±0.0005 (0.13×) 0.72±0.002 (0.04×) 0.19× / 0.09×
M AESTRO (λgp = 256e−6 ) 91.12±0.19 0.04±0.0007 (0.09×) 0.50±0.023 (0.02×) 0.12× / 0.05×
M AESTRO (λgp = 512e−6 ) 88.53±0.13 0.03±0.0003 (0.06×) 0.35±0.003 (0.02×) 0.08× / 0.03×

Table 13: Transformer performance on Multi30k for different regularization parameters. The last column in the
table displays the relative total training cost in terms of the number of Multiply-Accumulate operations (MACs)
and model parameters, compared to the non-factorized model.
Variant Acc. (%) Ppl. GMACs (Inf.) Params. (M) (Inf.) Rel. MACs / Params. (Train.)
Non-Factorized 65.33±1.13 9.85±0.10 1.370±0.0000 (1.00×) 53.9±0.000 (1.00×) 1.00× / 1.00×
M AESTRO (λgp = 0.32) 61.30±0.26 12.99±0.31 1.125±0.0030 (0.82×) 45.1±0.101 (0.84×) 1.03× / 1.14×
M AESTRO (λgp = 0.64) 63.78±0.14 9.37±0.32 0.957±0.0112 (0.70×) 39.1±0.413 (0.73×) 0.95× / 1.05×
M AESTRO (λgp = 1.28) 66.14±0.08 7.02±0.17 0.570±0.0088 (0.42×) 25.3±0.315 (0.47×) 0.75× / 0.86×
M AESTRO (λgp = 2.56) 66.08±0.09 6.90±0.07 0.248±0.0032 (0.18×) 13.8±0.113 (0.26×) 0.47× / 0.58×
M AESTRO (λgp = 5.12) 57.70±0.13 13.97±0.43 0.123±0.0002 (0.9×) 9.3±0.001 (0.17×) 0.28× / 0.39×

Table 14: ResNet50 performance on Tiny-Imagenet-200 for different regularization parameters. The last column
in the table displays the relative total training cost in terms of the number of Multiply-Accumulate operations
(MACs) and model parameters, compared to the non-factorized model.
Variant Acc. (%) GMACs (Inf.) Params. (M) (Inf.) Rel. MACs / Params. (Train.)
Non-Factorized 61.74±0.27 5.19±0.0000 (1.00×) 23.9±0.000 (1.00×) 1.22× / 1.22×
M AESTRO (λgp = 0.) 61.05±0.09 5.19±0.0000 (1.00×) 23.9±0.000 (1.00×) 1.21× / 1.20×
M AESTRO (λgp = 4e−5 ) 60.13±0.34 4.72±0.0013 (0.91×) 18.8±0.017 (0.79×) 0.81× / 0.69×
M AESTRO (λgp = 8e−5 ) 59.20±0.40 3.01±0.0064 (0.58×) 9.64±0.023 (0.40×) 0.00× / 0.00×
M AESTRO (λgp = 16e−5 ) 58.35±0.40 1.49±0.0142 (0.29×) 4.48±0.022 (0.19×) 0.61× / 0.54×
M AESTRO (λgp = 32e−5 ) 56.52±0.08 0.72±0.0022 (0.14×) 2.25±0.013 (0.09×) 0.51× / 0.47×

21
100
90
95
80 80
90
85 70
Test Accuracy

Test Accuracy
Test Accuracy

60 60
80
75 50
40
70 40
65 30
Pruned Maestro Model 20 Pruned Maestro Model Pruned Maestro Model
60 Maestro models with varying Maestro models with varying 20 Maestro models with varying
5000 10000 15000 20000 25000 30000 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0
Number of Parameters Number of Parameters 1e6 Number of Parameters 1e7

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

Figure 11: M AESTRO with progressive pruning to showcase nested rank importance structure. The original
model corresponds to an evaluation in Fig. 10, and pruned models are based on M AESTRO with λgl = 0, and
they are pruned using the same ranks as selected by M AESTRO with λgl > 0.

BDO ATM-Debit-Card-Form
0% (1)
BDO ATM-Debit-Card-Form
1 page
Literature Review Smart Street Light
75% (16)
Literature Review Smart Street Light
2 pages
Exploring Low Rank Training of Deep Neural Networks
No ratings yet
Exploring Low Rank Training of Deep Neural Networks
7 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
DeepLearning 4 and 5
No ratings yet
DeepLearning 4 and 5
60 pages
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
No ratings yet
Riemannian Low-Rank Model Compression for Federated Learning With Over-The-Air Aggregation
16 pages
Deep-Learning-Module-2-Important-Topics-PYQs
No ratings yet
Deep-Learning-Module-2-Important-Topics-PYQs
30 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
1 - A Day in The Life of ChatGPT As A Researcher
No ratings yet
1 - A Day in The Life of ChatGPT As A Researcher
20 pages
MAY DEEP LEARNING
No ratings yet
MAY DEEP LEARNING
16 pages
DL Class3
No ratings yet
DL Class3
28 pages
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng et al 2021 - Effective and Efficient Batch Normalization
15 pages
EDP An Efficient Decomposition and Pruning Scheme
No ratings yet
EDP An Efficient Decomposition and Pruning Scheme
15 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
konigstein2_v-8_ScrambledContent_chapter-9
No ratings yet
konigstein2_v-8_ScrambledContent_chapter-9
10 pages
Compressing Neural Networks Using The Variational Information Bottleneck
No ratings yet
Compressing Neural Networks Using The Variational Information Bottleneck
27 pages
Unit – IV
No ratings yet
Unit – IV
24 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
1810 01075 PDF
No ratings yet
1810 01075 PDF
59 pages
Deep Learning Module-03
No ratings yet
Deep Learning Module-03
20 pages
UQ Review
No ratings yet
UQ Review
129 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
UNIT3
No ratings yet
UNIT3
17 pages
ML System Optimization - Lecture 10 - Model Optimization Techniques
No ratings yet
ML System Optimization - Lecture 10 - Model Optimization Techniques
33 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
The Modern Mathematics of Deep Learning
No ratings yet
The Modern Mathematics of Deep Learning
78 pages
Weight Dropout for Preventing Neural Networks From Overfitting
No ratings yet
Weight Dropout for Preventing Neural Networks From Overfitting
4 pages
3
No ratings yet
3
11 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Sciadv Abi8605
No ratings yet
Sciadv Abi8605
10 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
DL mod 2
No ratings yet
DL mod 2
4 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
UNIT-5 part1
No ratings yet
UNIT-5 part1
15 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DL (1-10)
No ratings yet
DL (1-10)
10 pages
ModelCompressionTechniquesinDeepLearning
No ratings yet
ModelCompressionTechniquesinDeepLearning
23 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Applsci 12 11184
No ratings yet
Applsci 12 11184
18 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
Face Recognition Based On Deep Autoencoder Networks With Dropout
No ratings yet
Face Recognition Based On Deep Autoencoder Networks With Dropout
4 pages
MODULE 2 Deep Learning
No ratings yet
MODULE 2 Deep Learning
26 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
A Survey of Model Compression and Acceleration For Deep Neural Networks
No ratings yet
A Survey of Model Compression and Acceleration For Deep Neural Networks
10 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
Nverse Deep Learning Methods and Benchmarks For Artificial Electromagnetic Material Design
No ratings yet
Nverse Deep Learning Methods and Benchmarks For Artificial Electromagnetic Material Design
22 pages
A Review On Basic Deep Learning
No ratings yet
A Review On Basic Deep Learning
9 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Study Material - Consumer Behaviour
No ratings yet
Study Material - Consumer Behaviour
19 pages
Full download QoS and Energy Management in Cognitive Radio Network: Case Study Approach 1st Edition Vishram Mishra pdf docx
100% (3)
Full download QoS and Energy Management in Cognitive Radio Network: Case Study Approach 1st Edition Vishram Mishra pdf docx
65 pages
Mwos Saturday Fixtures-2
No ratings yet
Mwos Saturday Fixtures-2
14 pages
2017 Annual Report Updated
No ratings yet
2017 Annual Report Updated
263 pages
Self Curing Concrete
No ratings yet
Self Curing Concrete
10 pages
Ent Table Viva
No ratings yet
Ent Table Viva
7 pages
GBPJPY PSYCHE
No ratings yet
GBPJPY PSYCHE
14 pages
A View From The Top Script
No ratings yet
A View From The Top Script
118 pages
Highlighted Unit 2 Notes
No ratings yet
Highlighted Unit 2 Notes
3 pages
2020 ESAH Complete v3.1
No ratings yet
2020 ESAH Complete v3.1
130 pages
Kinh doanh quốc tế task 3
No ratings yet
Kinh doanh quốc tế task 3
19 pages
Locke’s Goal-Setting Theory of Motivation
No ratings yet
Locke’s Goal-Setting Theory of Motivation
20 pages
Adams Fire Stop PO
No ratings yet
Adams Fire Stop PO
3 pages
Project Name: Period Reported On: Action Items: Checkpoint Report
No ratings yet
Project Name: Period Reported On: Action Items: Checkpoint Report
1 page
PSS-R1 - Sworn Statement
No ratings yet
PSS-R1 - Sworn Statement
2 pages
Happiness Excellence and Optimal Human Functioning Revisited Examining The Peer Reviewed Literature Linked To Positive Psychology - 2015
No ratings yet
Happiness Excellence and Optimal Human Functioning Revisited Examining The Peer Reviewed Literature Linked To Positive Psychology - 2015
12 pages
Lm1 Examination Guide for Exams From 1 January 2024 Until 31 December 2024
No ratings yet
Lm1 Examination Guide for Exams From 1 January 2024 Until 31 December 2024
19 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
10 pages
How To Setup L3 Roaming Based Client IP Subnet
No ratings yet
How To Setup L3 Roaming Based Client IP Subnet
9 pages
Anexa SAF-T - Structura - D
No ratings yet
Anexa SAF-T - Structura - D
134 pages
Exercise 1: Reactors: 1.1 Reactor Selection
No ratings yet
Exercise 1: Reactors: 1.1 Reactor Selection
4 pages
expense
No ratings yet
expense
7 pages
Form B
No ratings yet
Form B
3 pages
Fa2e Law of Crimes 1
No ratings yet
Fa2e Law of Crimes 1
3 pages
ECON2010 Topic 4 Worksheets
No ratings yet
ECON2010 Topic 4 Worksheets
9 pages
MediMatch Project Report
No ratings yet
MediMatch Project Report
23 pages
Code Division Multiple Access CDMA System in Multipath Environment
No ratings yet
Code Division Multiple Access CDMA System in Multipath Environment
6 pages
Clinical Pharmacology 11th Edition Morris J Brown Pankaj Sharma Peter N Bennett instant download
No ratings yet
Clinical Pharmacology 11th Edition Morris J Brown Pankaj Sharma Peter N Bennett instant download
90 pages

Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition

Uploaded by

Maestro: Uncovering Low-Rank Structures Via Trainable Decomposition

Uploaded by

Maestro: Uncovering Low-Rank Structures via

Samuel Horvath Stefanos Laskaridis

Shashank Rajput Hongyi Wang

Preprint. Under review.

Our contributions can be summarized as follows:

3.2 Layer factorization

3.3 Training techniques

Algorithm 1: M AESTRO (Training Process)

+λgl di=1 rb=1

5.1 Experimental setup

0.56 11.18 Image classification

5.2 Performance comparison

Figure 3: Training dynamics of M AESTRO for ResNet18 on CIFAR10.

6 Conclusion and future work

D Theoretical Properties of Low-Rank Layers 14

As for the previous case, we can reparametrise the problem to obtain

This can be further simplified using reparametrization V ⊤ → V ⊤ Σ̃−1

Table 6: The hybrid ResNet architecture for the CIFAR-

Table 9: Detailed information of the decoder layer in

embedding 9521 × 512 padding index: 1

embedding 9521 × 512 padding index: 1

E.3 Hyperparameter selection

E.4 Deciding against decomposition

F.2 Training behaviour of M AESTRO

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

120 Full Rank 500 Full Rank 500

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

F.3 Model size-accuracy trade-off at training and deployment time

In addition to the original illustrations, we present an extended interpretation of Fig. 4, where we

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

99.0 Maestro (With Hierarchical Regularization)

Number of Parameters Number of Parameters 1e7 Number of Parameters 1e7

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

(a) LeNet on MNIST (b) ResNet-18 on CIFAR10 (c) VGG19 on CIFAR10

You might also like