(LoRA稀疏化)2024-CMU-CS-24
(LoRA稀疏化)2024-CMU-CS-24
Federated Learning
Arian Raje
CMU-CS-24-123
May 2024
Thesis Committee:
Virginia Smith, Chair
Zhihao Jia
Gauri Joshi
1 Introduction 1
2 Background 3
2.1 Federated Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Large Language Model Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Sparsity and Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Related Works 11
3.1 Communication Efficiency in Federated Learning . . . . . . . . . . . . . . . . . 11
3.2 Fine-Tuning in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Pruning Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Methods 15
4.1 Federated LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Sparsity for LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 FLoSS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Results 19
5.1 Model Performance with Fixed Communication . . . . . . . . . . . . . . . . . . 19
5.2 Robustness to Statistical Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Improvements to Communication Efficiency . . . . . . . . . . . . . . . . . . . . 21
5.4 Privacy-Preserving Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Bibliography 27
vii
viii
List of Figures
2.1 FL procedures involve 4 steps in each communication round. (1) Clients down-
load model weights from the central server. (2) Clients train the model on local
data. (3) Clients upload local model weights to the central server. (4) The central
server aggregates client updates into a new global model. . . . . . . . . . . . . . 4
2.2 LLM fine-tuning involves taking an open-source model trained on generic data
and adapting the weights on task-specific data. . . . . . . . . . . . . . . . . . . . 5
2.3 LoRA inserts small trainable matrices into the model architecture and freezes the
remaining model weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Structured pruning sets entire structures to 0 while unstructured pruning sets
individual weights to 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Pruning at initialization prunes once before the start of model training while
iterative pruning prunes repeatedly throughout the training procedure. . . . . . . 9
4.1 FLoSS procedure involves (1) sparsifying adapters prior to download (2) train-
ing dense adapters (3) sparsifying adapters prior to upload (4) aggregating using
FedAdam. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
ix
x
List of Tables
xi
xii
Chapter 1
Introduction
The increased ubiquity of edge devices, such as IoT devices and smartphones, has made federated
learning (FL) paradigms a feasible way to train machine learning (ML) models [35]. In tradi-
tional ML applications, a central server would first aggregate the data from disparate sources. A
model would then be trained at the server level using this aggregated data. While FL applications
similarly involve a central server and disparate data sources, which we refer to as “clients”, FL
offers an alternative to traditional training procedures. FL applications will instead first distribute
the model parameters from the central server to the clients (download phase). The clients then
independently train the model parameters they receive from the central server on their local data.
Once local training is complete, the clients send the model parameters back to the central server
(upload phase) where the central server aggregates the clients’ models into a new global model.
In total, the above steps constitute a single “communication round” for the training procedure.
This process is repeated for multiple communication rounds, with each round involving sampling
clients, training local models, and aggregating the models at the central server. In this training
scheme, data never leaves the clients’ devices and only the model parameters are communicated
between the central server and the clients. As a consequence, FL offers potential privacy bene-
fits over traditional ML [5]. FL training offers other benefits include improved personalization
[45] and scalability [59]. FL-based training schemes are particularly useful in large networks of
similar devices and have already been productionized at scale [6, 24].
While FL continues to become more valuable in practice, practical concerns about its utility
remain. In particular, the use of Large Language Models (LLMs) has become standard for a
vast number of ML problems [47]. In many settings, the LLM being used may have billions
of trainable parameters. The scale of these LLMs presents serious compute and communication
bottlenecks for distributed training schemes. Since the advent of LLMs, a common strategy to
use LLMs has been the pretrain-then-fine-tune framework. In essence, LLMs like GPT [8] or
BERT [29], which have already been pretrained on a large corpus of data, can be fine-tuned on
a downstream task. A consequence of this framework is that the updates to the pretrained model
weights take on low-rank structures, eliminating the need for full fine-tuning of all the model
weights. In recent years, adapter methods for LLMs have been developed to incorporate these
ideas by injecting a small set of trainable parameters into each transformer block and freezing
the remaining parameters of the model [21, 23]. Therefore, when a pretrained LLM is being
1
fine-tuned on a downstream task, only a smaller portion of parameters must be trained.
Adapter methods offer a concrete way to reduce the compute and communication loads of LLM
training in a distributed or federated setting. However, since FL training can still require a large
number of communication rounds and clients, the communication costs of adapter methods in
FL can still be prohibitively high. While adapter methods reduce some of the communication
bottlenecks associated with distributed LLM training, in practice they can still lead to slow com-
munication while seeing more significant drops in model utility. They may additionally increase
storage and inference costs by increasing the number of total parameters included in the model
[51].
LLMs broadly have achieved state-of-the-art performance in multiple domains and in many ways
have become the standard approach for modeling with large amounts of data. Nonetheless, they
remain difficult to implement in FL. The fact that FL applications involve communication over
a wireless network means that coordinating large model updates from a large array of clients
is specifically difficult and has been a primary concern in utilizing LLMs for FL. The goal of
this thesis is to introduce a method to perform communication-efficient LLM training for
FL. To this end, we introduce Federated LoRA with Simple Sparsity (FLoSS). We specifically
employ sparsity to low-rank adaptation (LoRA) during only the download and upload phases of
FL training in order to retain the model’s utility while restricting communication. We addition-
ally suggest heuristics to accurately select a rank and download/upload sparsity ratio to perform
accurate model training given an arbitrary communication budget. We summarize our contribu-
tions as follows -
1. To the best of our knowledge, we are the first to apply unstructured sparsity to LoRA for
efficient federated fine-tuning. We focus on unstructured (weight-level) sparsity because it
has been shown to outperform structured (block-level) sparsity in centralized settings.
2. We propose FLoSS, a simple baseline that applies a constant top-k sparsity only to com-
munication. This method can reduce communication costs up to 10× while matching the
performance of dense LoRA on several FL image and text tasks.
3. We simulate an FL training procedure on a network with realistic download and upload
communication speeds. Given a communication budget, we recommend heuristics to accu-
rately select a LoRA rank and download/upload sparsity ratios that maximize the model’s
utility within that budget.
FLoSS aims to make LLM training more feasible in a federated setting. LLM training in
resource-constrained environments remains an open research problem and an important field
of study as LLMs grow in size. We hope to contribute to this growing field of work by proposing
methods to improve efficiency while retaining utility in real-world settings.
2
Chapter 2
Background
Fi describes the local objective for client i. In our setup, we treat Fi as the loss of the model
parameterized by w with respect to the local data on client device i -
mi
1 X
Fi (w) = li (x(j) , y (j) ; w) (2.2)
mi j=1
Each client has mi local examples and local Ploss function li . Our global objective F is weighted
k
by parameters pi where each pi ≥ 0 and i=0 pi = 1. While multiple weighting schemes for
clients exist, we define pi = k1 ∀i ∈ [1 . . . k]. With this weighting, the global objective function
treats each client equally regardless of the distribution of local data. Therefore, an FL training
procedure aims to find parameters w that minimize the average of the loss across the k clients.
We note that there exist certain FL settings that train multiple models (optimizing multiple wi ’s)
[4] or have different tasks for different clients (using different li ’s) [15]. Our focus is single-
task, single-model FL where each client uses the same loss function and optimizes a single set of
parameters w. The next section described two common procedures used to achieve this objective,
namely FedAvg and FedAdam.
3
Figure 2.1: FL procedures involve 4 steps in each communication round. (1) Clients download
model weights from the central server. (2) Clients train the model on local data. (3) Clients
upload local model weights to the central server. (4) The central server aggregates client updates
into a new global model.
We present two algorithms that aim to find model parameters w that minimize the objective
F . These methods draw from historical works in distributed optimization. The first method is
Federated Averaging (FedAvg) [35]. Consider a total of k client devices. At communication
round t, n clients are sampled from the k total clients. These n clients each download a copy of
the global model with parameters wt from the central server and train the model for e epochs on
(1) (2) (n)
local client data. This results in local models wt , wt , . . . , wt . These models are uploaded
back to the central server. The central server then aggregates by averaging these models to define
the new global model. This update is formulated as follows -
n
1 X (i)
wt+1 ← w (2.3)
n i=1 t
This process is repeated for a total T communication rounds. In this straightforward method,
successive averaging of client updates aims to make the final global model wT accurate for all
clients k despite sampling a fraction of the clients in each communication round. In settings with
large k, it is possible for n << k to still result in an accurate final model as has been shown
empirically in many studies. Federated Adam (FedAdam) [40] proceeds similarly to FedAvg
with a slight adjustment made to the aggregation mechanism. In FedAdam, once the clients
(1) (2) (n) (i) (i)
performing local training to calculate wt , wt , . . . , wt , they each calculate ∆t = wt − wt
(i)
and upload ∆t back to the central server. The central server calculates the averages of these
(i)
differences ∆t = n1 n1=1 ∆t and uses this ∆t as a “pseudo-gradient” for an Adam optimizer.
P
4
Figure 2.2: LLM fine-tuning involves taking an open-source model trained on generic data and
adapting the weights on task-specific data.
5
Figure 2.3: LoRA inserts small trainable matrices into the model architecture and freezes the
remaining model weights.
accuracy, training LLMs presents serious logistical and computational challenges. To address
some of these challenges, recent literature has focused on a pretrain-then-fine-tune setup for
LLM training. The core idea is to use open-access LLMs that have been pretrained on a large
corpus of public data and then fine-tune the model weights on a domain-specific task [18]. Mod-
els like LLaMA [46], GPT [8], BERT [29], and ViT [13] have become standard open-source
models to use in this training paradigm for tasks including representation learning, chat, and
image classification. While this pretrain-then-fine-tune setup helps resolve some of the massive
data requirements for LLM training, it still suffers from the computational, memory, and storage
requirements for training all the weights of an LLM.
Adapter methods improve the computational efficiency of LLM training by reparameterizing the
updates to the model [18]. Instead of training all the weights of the LLM, adapter methods inject
a small set of trainable weights into the model architecture and freeze the original model weights
at their pretrained value. These methods are inspired by the idea that the change in weights from
a pretrained model to a fine-tuned model exist in a low-rank space. The most frequently used
adapter is Low-Rank Adaptation (LoRA) [23]. LoRA reparameterizes weight updates as follows.
Consider an initial weight matrix W0 ∈ Rd×d . The update to W0 , which we call ∆W ∈ Rd×d can
be defined as a product BA where B ∈ Rd×r and A ∈ Rr×d . In this case r is a hyperparameter
and is generally defined in a way such that r << d. In order to make training using LoRA more
efficient, W0 is frozen at its pretrained value and only B and A receive gradient updates. The
forward pass of the model with input x can simply be defined as follows -
W x = (W0 + BA)x = W0 x + BAx (2.5)
B is initialized as the 0 matrix while A is initialized as N (0, σ 2 ) so at the initialization of the
LoRA parameters, W x = (W0 + 0A)x = W0 x. Ultimately, by training the LoRA parameters
instead of W0 directly in this context, we train only 2dr parameters as opposed to d2 parameters.
Choosing an adequately small r can greatly improve the efficiency of fine-tuning.
Inserting LoRA parameters into an LLM is straightforward. In each transformer block of an LLM
6
(a) Structured Pruning (b) Unstructured Pruning
Figure 2.4: Structured pruning sets entire structures to 0 while unstructured pruning sets individ-
ual weights to 0.
QK T
MSA(X) = softmax( √ )V (2.6)
d
In this calculation, WQ , WK , and WV are all trainable parameters. With LoRA, we insert BQ , AQ ,
BK , AK , and BV , AV and freeze WQ , WK , and WV at their pretrained weights. We use LoRA
for our experiments for the following reasons -
• LoRA parameters can be easily merged back into the model by calculating W = W0 + BA
to reduce inference and storage costs.
• LoRA can be easily integrated with any transformer-based architecture as these architec-
tures all use some form of MSA and LoRA parameters can be defined for the weight
matrices used in MSA.
• Using LoRA significantly reduces VRAM consumption making local LLM training feasi-
ble for client devices in an FL setting.
7
represent the model in a sparse matrix format [10]. These methods rank the model weights ac-
cording to a scoring function and prune a fraction of the weights with the lowest scores. While
there are numerous approaches to pruning and sparsification, we broadly divide current pruning
literature along the following axes: structured vs. unstructured and pruning at initialization vs.
iterative pruning. We describe these categories and their consequences. First, we compare struc-
tured and unstructured pruning -
• Structured pruning - In structured pruning, the scoring function ranks existing structures
within the model architecture (filters, channels, layers, etc.) [52]. Structures are then
pruned in their entirety by setting every parameter that exists in that structure to 0. Struc-
tured pruning is useful in accelerating training because block level sparsity can speed up
matrix multiplication on GPUs. However, the downside to structured pruning is that there
is less flexibility in pruning patterns, often resulting in a model that is significantly less
accurate than its dense counterpart.
• Unstructured pruning - In unstructured pruning, the scoring function ranks individual
weights within the model architecture [57]. Weights with the lowest scores are set to 0
independently of other weights in the model. While unstructured pruning offers more
flexibility in sparsity patterns and usually results in more accurate models, they offer few
benefits in terms of training efficiency and inference latency. This is because random spar-
sity patterns require custom hardware to see noticeable speedups for operations like matrix
multiplication. For general consumer hardware, there are few efficiency advantages to
applying unstructured pruning [49].
The next dimension on which pruning techniques differ is pruning at initialization and iterative
pruning -
We choose to focus on sparsity because recent literature has demonstrated that models can be
8
(a) Pruning at Initialization
Figure 2.5: Pruning at initialization prunes once before the start of model training while iterative
pruning prunes repeatedly throughout the training procedure.
significantly compressed using these techniques while retaining performance close to the original
model. In contrast, techniques like quantization, which aim to represent weights in lower-bit
precision formats, seem to degrade in accuracy more significantly with fewer memory savings.
Even still, FLoSS must balance the utility-memory trade-offs described in the above definitions.
Without a careful consideration of system design, it remains difficult to train a model using high
levels of sparsity that is as accurate as its dense counterpart. We describe the procedure to achieve
considerable performance and efficiency in our methods section.
9
10
Chapter 3
Related Works
Modern approaches consider two different ways to tackle the issue of communication efficiency
in FL, namely either reducing the number of communication rounds or reducing the size of com-
municated messages. Some approaches that fall into these categories are described below -
• Reducing the number of communication rounds - Approaches that aim to reduce the num-
ber of communication rounds focus on systematically improving model convergence time.
For example, CMFL [48] does not upload outlier updates from clients to keep updates in
each round relevant to global model convergence. CA-FL [1] and FL+HC [7] use clus-
tering to determine the most representative update from a set of clients and only sends
that update to the central server. FedBoost [17] uses an ensemble of models that converge
faster than a single larger model to reduce the number of updates required for each of the
ensemble models. The method described in [9] uses a probabilistic model to select devices
most likely to contribute to faster model convergence. Finally, methods like One-Shot FL
[16] and k-Fed [11] propose performing all FL training in a single communication round
as opposed to iteratively updating the global model over multiple communication rounds.
While some of these methods have empirically demonstrated improvements over vanilla
FL, many have concrete failure modes that make it difficult to demonstrate their effective-
ness in a broad array of real-world settings. Additionally, many of these methods require
changes to the FL training procedure that are difficult to implement at scale.
11
• Reducing the size of communicated messages - Methods that reduce the size of commu-
nicated messages focus on sending or receiving partial updates or compressing the size
of updates. LFL [2] and FedPaq [41] use quantization to represent model updates at a
lower bit precision. FedMP [27], PruneFL [26], and model pruning for HFL [34] all use
various types of pruning during the communication phases of FL. Finally, FedKD [53]
and DS-FL [25] use knowledge distillation to compress the model parameters and improve
communication efficiency. We focus on these methods as they demonstrably reduce the
communication cost at each communication round in training. However, we note that a
significant drawback of these methods is that reducing the amount of information commu-
nicated at each round may degrade the model’s utility. We design FLoSS in a way that
balances communication efficiency and model performance across clients.
12
of communication in FL. There are important design considerations when utilizing sparse LoRA
methods in FL. For example, depending on how sparsity is applied, communication rounds may
have varying levels of sparse communication. The result is that certain rounds may be slower
than others during the training process leading to lags and inefficient training. More impor-
tantly, sparsity, as a form of lossy model compression, can result in inaccurate models that fail
to perform at the level of their dense counterpart. Therefore, applying extreme levels of sparsity,
while helpful for efficient communication, may diminish the model’s utility. For these reasons,
there are critical design elements in a sparse LoRA implementation for FL that do not exist in
centralized settings.
13
14
Chapter 4
Methods
This section details the combination of adapter methods and sparsity techniques we employ in
FLoSS. We additionally highlight the importance of each step in resolving real-world bottle-
necks, such as on-device computation and communication constraints, for FL training. Finally,
we describe a set of benchmarks we compare against to demonstrate the effectiveness of our
training procedure.
Using LoRA for FL is particularly useful as it enables efficient on-device training. Training
the full LLM at the client level would be difficult given memory and computation constraints.
In cross-device FL settings, client devices tend to be especially resource-constrained. As such,
reducing the computational load of FL training is important in this setting.
15
4.2 Sparsity for LoRA
While LoRA enables on-device training by significantly reducing VRAM usage, LoRA param-
eters can still be expensive to communicate. LoRA adapters are inserted for each weight matrix
in the MSA layer and for each transformer block in the model. Communicating all these pa-
rameters is difficult on a wireless network where communication, and specifically the upload
phase, is slow and time-consuming. There can also be a large number of clients and communi-
cation rounds, necessitating faster communication. Thus, it is important to retain the on-device
computational benefits of LoRA training while reducing communication latency for this train-
ing procedure. We reduce communication cost by applying top-k sparsity to LoRA during the
download and upload phase of FL training.
To perform top-k sparsification consider a weight matrix W ∈ Rd×k with n nonzero entries.
We define a sparsity ratio α ∈ [0, 1]. We apply the top-k function to |W | where k = α · n
and retrieve the indices of these top-k values. We call this set of indices I = {(yi , xi ) | yi ∈
[0, d − 1], xi ∈ [0, k − 1]}. We then define a binary mask M ∈ Rd×k where M [yi , xi ] = 1 if
(yi , xi ) ∈ I and M [yi , xi ] = 1 otherwise. W is then updated as W ⊙ M . This functions to mask
the smallest weights in W .
In our training scheme, top-k sparsity is applied to the LoRA adapters prior to download and
prior to upload. In this way, fewer weights are communicated in each phase, reducing com-
munication latency as a result. There are a few important design considerations in determining
how sparsity should be specifically applied to LoRA. First, the sparsity ratio can be different
for download and upload. This is because download bandwidth is typically much greater than
upload bandwidth (the download bandwidth for a wireless network can be 10× greater than the
upload bandwidth for a wireless network). Therefore, being able to configure separate sparsity
ratios is important as sparser downloads may be less helpful than sparser uploads. Second, un-
structured sparsity is more useful in our setup than structured sparsity. An important distinction
is that FLoSS does not retain sparsity during client local training. This is because sparsity would
be unhelpful in accelerating training on devices such as smartphones or IoT devices where the
hardware is not specifically designed to handle sparse computation. Therefore, dense training
would yield results just as quickly and much more accurately than sparse training at the client
level. Since unstructured sparsity has been shown to boost performance relative to structured
sparsity, and training acceleration is not a consideration, unstructured sparsity is the preferred
sparsity method for communication in FLoSS.
16
Figure 4.1: FLoSS procedure involves (1) sparsifying adapters prior to download (2) training
dense adapters (3) sparsifying adapters prior to upload (4) aggregating using FedAdam.
munication round ends with the central server aggregating these sparse adapters.
FLoSS offers a few key advantages for federated LLM training. First, FLoSS enables more
efficient on-device training for LLMs. It does so by training only LoRA adapter weights and
limiting local training to a single epoch in each communication round for all clients. Since cross-
device settings usually involve many clients where each client has limited data, we argue that this
training scheme is feasible at the client level. Second, FLoSS reduces communication latency
despite the large number of adapter weights being communicated in each round. By sparsify-
ing adapters prior to communication, fewer weights are communicated, and communication can
occur more quickly. Finally, FLoSS retains model performance by allowing dense fine-tuning
at the client level. We demonstrate the importance of this last feature in an ablation study that
identifies the consequence of various pruning methods on model performance [30]. Ultimately,
FLoSS achieves high communication efficiency without sacrificing model performance.
4.4 Benchmarks
We present experiments on three datasets: CIFAR10, 20NewsGroups, and Reddit. We resize the
CIFAR10 images to 224 × 224 to match ImageNet, the pretraining dataset for the ViT model
architecture we chose. We use the GPT2 tokenizer to preprocess the examples of 20Newsgroups
and Reddit into sequences with length 128 and 25 respectively. We partition CIFAR10 and
20NewsGroups across the client devices. As described in the Results section, we test both I.I.D.
clients as well as non-I.I.D. clients for partitioning both datasets. The Reddit comments are nat-
urally partitioned by user.
17
In all experiments, we sample 5 clients at each round and perform one epoch of local train-
ing with a batch size of 16. We fine-tune all models for 200 rounds. For the pretrained models,
we used ViT-B-16 (85M params) and GPT2-Small (124M params). For all datasets, we report
the accuracy on the validation partition. More details on the task setups can be found in Table 4.1.
We compare FLoSS against two other sparse LoRA methods, SparseAdapter and AdapterLTH.
Both these methods are described in section 3.3. To use AdapterLTH in FL, we consider training
LoRA weights A and B using FedAdam. After each aggregation round, we apply increasingly
sparse masks to the LoRA weights. We use the efficient “fine-tuning” version of LTH which con-
tinues training from the pruned state rather than rewinding the weights after pruning. This allows
the model to recover from pruning within fewer rounds and is necessary to keep communication
costs competitive with the dense LoRA baseline. For both of these methods our choice of scor-
ing function is top-k applied to the magnitude of the weight. This allows for a direct comparison
to FLoSS which uses the same unstructured scoring function but only sparsifies communication
as opposed to sparsifying both communication and model training. For these pruning methods,
we perform an initial round of dense LoRA training, so the B adapter is non-zero and can be
effectively pruned using our score function.
18
Chapter 5
Results
We describe our key findings when running FL training experiments with FLoSS. There are
key metrics by which we can measure the effectiveness of this training procedure. Concretely,
we determine improvements to model performance, communication efficiency, and privacy over
other efficient adapter methods for FL. The first section compares the performance of models
trained with FLoSS with both dense training as well as other pruning benchmarks. We highlight
that FLoSS achieves performance comparable to dense training, even at high levels of sparsity,
while outperforming other pruning benchmarks. The second section shows that we can retain
this high level of performance across multiple heterogeneous settings. The third section serves to
demonstrate the communication benefits of FLoSS by reducing communication time in realistic
wireless networks. Finally, we use an implementation of DP-FedAdam [36] to show that FLoSS
achieves a desirable privacy-utility tradeoff.
19
(a) LoRA Rank = 4 (b) LoRA Rank = 16
The results in figure 5.1 demonstrate that FLoSS outperforms other efficient fine-tuning methods
in a federated setting. Specifically, FLoSS performs significantly better than the one-shot pruning
method specified in SparseAdapter and marginally better than AdapterLTH and dense training of
a LoRA adapter with 14 rank. These findings are consistent in experiments in CIFAR10, 20News-
Groups, and Reddit, demonstrating that the method of sparsifying an adapter prior to download
and upload is an effective way to reduce communication costs for multiple tasks. Additionally,
this sparsification works even with smaller ranks (e.g. rank=4) meaning that we can get espe-
cially efficient communication by using a combination of LoRA where the rank is significantly
smaller than the embedding dimension in conjunction with sparse updates. While we note that
some of these performance benefits appear to be marginal at first, we show in later sections that
FLoSS offers significant communication benefits in comparison to the other methods described
as performance does not degrade with extreme levels of sparsity and different download/upload
sparsity ratios.
20
Figure 5.2: Comparison of communication-efficient LoRA methods in FL with non-I.I.D clients.
Based on the results in figure 5.2, FLoSS performs well even in the presence of extreme statistical
heterogeneity. For example, even at α = 0.1 and α = 0.01 where clients are primarily sampling
examples from a single label, there is little degradation in the model’s utility in comparison to the
I.I.D. setting. Another important observation is that [22] reports that statistical heterogeneity has
an adverse impact on a CNN architecture trained on CIFAR10 for FL. In comparison, we find
that fine-tuning a pretrained LLM on a similarly heterogeneous CIFAR10 partitioning for FL is
not hindered by the same performance losses. This suggests that LLMs may offer a potential
solution to the performance problems that arise as a result of statistical heterogeneity in FL.
21
(a) Equal upload and download sparsity (b) Unequal upload and download sparsity
Figure 5.3: Performance of FLoSS on CIFAR10 with varying rank and sparsity configurations.
There are two critical reasons why FLoSS is able to reduce communication costs considerably.
First, the method is able to utilize extremely sparse communication (e.g. approximately 0.01
sparsity ratio) and retain performance. In comparison, SparseAdapter noticeably degrades in
performance even at sparsity ratios of 0.25 as demonstrated in section 5.1. AdapterLTH relies on
pruning very few additional parameters at every communication round. However, this iterative
pruning technique is not well-suited for extreme sparsity. In order for AdapterLTH to produce
an average sparsity ratio of 0.01 across 200 communication rounds, the method would have to
iteratively prune 50% of the weights at every communication round. In the last communication
round, only 1.24e-60 of the LoRA weights would remain, leading to a model that is no more
accurate than the pretrained model. Figure 5.3a demonstrates that across multiple ranks, FLoSS
can support extreme sparsity and result in a model significantly better than the pretrained model.
The second reason why FLoSS is able to significantly cut down on communication costs is
because the method is able to define separate download and upload sparsity ratios. In wireless
networks, upload is significantly more expensive than download (with upload sometimes being
up to 10x slower) [42]. In cases where upload speed is significantly reduced in comparison to
download speed, methods like AdapterLTH and SparseAdapter are too rigid in that AdapterLTH
can only prune prior to download and SparseAdapter can only prune prior to model training.
Figure 5.3b demonstrates the model performance of FloSS trained with a fixed download spar-
sity of 0.25 and varying upload sparsity. In real-world settings where upload is much slower
than download, we can use stricter sparsification of uploaded parameters. This prevents lags and
delays during the upload phase of communication. Our results demonstrate that we can retain
model performance with uneven download and upload sparsity configurations making FLoSS
better suited for real-world FL configurations that have disparate download and upload speeds.
22
(a) DP for 20NewsGroups (b) DP for Reddit
leaks of user data. For this reason, it is important to consider notions of privacy in our efficient
LLM training method for FL. We apply the definition of differential privacy (DP) used in [36].
DP ensures user-level privacy, as opposed to example-level privacy, so that the parameters for
the model can be publicly released with the guarantee that an adversary has a limited ability to
learn about the data used to train the model. This is particularly important in FL as the model
parameters are communicated between clients and a central server. An adversarial client or an
adversary that has access to the server’s global model must be restricted in what they can learn
about other clients’ private data.
We implement the DP aggregation algorithm described in [38]. This process includes two crit-
ical steps. First, when clients upload their local ∆i , ∆i is clipped by scaling each value by
1
||∆i ||2 where C is the clipping norm. Noise is then added to this clipped delta to obscure
max(1, C
)
the original weights. This method ensures user-level DP with certain guarantees based on the
amount of noise added and the norm of the uploaded parameters. However, while smaller norms
and more noise enforce stricter privacy guarantees, they can diminish the model’s utility. We
analyze the performance of FLoSS using DP-FedAdam with varying levels of privacy guaran-
tees. We further compare the method to FFA [44] which aims to preserve privacy for federated
LoRA by freezing the LoRA A parameters at initialization. This method approximately halves
the communication cost of LoRA and reduces the norm of uploaded model updates.
Figure 5.4 demonstrates that FLoSS can effectively preserve privacy while retaining perfor-
mance with a variety of privacy budgets. Even though the method is not specifically designed for
privacy-preserving FL, the method can be easily integrated with DP-FedAdam to ensure user-
level privacy.
23
24
Chapter 6
This thesis examines the problem of training LLMs in a federated environment. The scale of
LLMs presents a problem in FL settings due to the limited compute on client devices and limited
communication across the network. We present a few critical steps that can be taken to tackle
these bottlenecks and enable efficient training. First, we employ adapter methods, specifically
LoRA, to reduce the computational load of local model training. LoRA is especially useful in
FL as client devices do not have to train full model parameters for the LLM, and the adapters
can be merged back into the backbone of the model to eliminate additional storage and inference
costs after training. We augment the communication efficiency of LoRA using sparsity applied
only to communication. Crucially, we find that applying sparsity in this way is important to
retain performance. Alternative methods that prune adapters at initialization or iteratively prune
adapters throughout training do not perform as well with limited communication budgets or cases
where upload speed is slower than download speed. These limitations suggest that sparsity must
be carefully applied during the training process to ensure that communication efficiency does not
degrade model utility. A few crucial directions remain to be explored in the context of efficient
LLM training for FL. We describe some of these future directions below.
Local Compute While LoRA certainly makes local computation easier for client devices,
methods to further enable LLM training in resource-constrained environments may improve the
functionality of this method. Some recent research has focused on the importance of efficient
on-device training [33]. A few methods are natural choices for improving the on-device effi-
ciency of FLoSS. For example, approaches like QLoRA [12] and LoftQ [32] significantly reduce
memory usage in LoRA training by quantizing the model backbone. These methods could be
integrated into FLoSS to further augment the on-device efficiency of the method.
Rank and Sparsity Configurations Currently, FLoSS requires manual selection of the rank
and sparsity configurations used in testing. Methods to automatically configure the rank and
sparsity ratios for FLoSS are important in preventing expensive FL hyperparameter tuning. Our
experiments suggest that certain rank and sparsity configurations do not perform as well as others
within the same communication budget. For example, Figure 6.1 demonstrates that configura-
tions with similar performance can have vastly overall communication times. This is especially
true in our simulated procedure where upload speed is 10× slower than download speed (20
25
(a) Communication time heatmap (b) Performance heatmap
Figure 6.1: Comparison of communication time vs. performance for FLoSS on CIFAR10
MBps vs. 200 MBps). Therefore, being able to automatically determine the optimal value for
these hyperparameters prior to training may be crucial in extracting the full value of LLMs in
this setting.
26
Bibliography
[1] Ahmed A. Al-Saedi, Veselka Boeva, and Emiliano Casalicchio. Reducing communication
overhead of federated learning through clustering analysis. In 2021 IEEE Symposium on
Computers and Communications (ISCC), pages 1–7, 2021. doi: 10.1109/ISCC53001.2021.
9631391. 3.1
[2] Mohammad Mohammadi Amiri, Deniz Gunduz, Sanjeev R Kulkarni, and H Vincent Poor.
Federated learning with quantized global model updates. arXiv preprint arXiv:2006.10672,
2020. 3.1
[3] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient
fine-tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav
Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin,
Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.
acl-short.1. URL https://aclanthology.org/2022.acl-short.1. 3.2
[4] Neelkamal Bhuyan, Sharayu Moharir, and Gauri Joshi. Multi-model federated learning
with provable guarantees. In Esa Hyytiä and Veeraruna Kavitha, editors, Performance Eval-
uation Methodologies and Tools, pages 207–222, Cham, 2023. Springer Nature Switzer-
land. ISBN 978-3-031-31234-2. 2.1.1
[5] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMa-
han, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggrega-
tion for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC
Conference on Computer and Communications Security, CCS ’17, page 1175–1191, New
York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349468.
doi: 10.1145/3133956.3133982. URL https://doi.org/10.1145/3133956.
3133982. 1
[6] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman,
Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMa-
han, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. To-
wards federated learning at scale: System design. In A. Talwalkar, V. Smith, and M. Za-
haria, editors, Proceedings of Machine Learning and Systems, volume 1, pages 374–
388, 2019. URL https://proceedings.mlsys.org/paper_files/paper/
2019/file/7b770da633baf74895be22a8807f1a8f-Paper.pdf. 1
[7] Christopher Briggs, Zhong Fan, and Peter Andras. Federated learning with hierarchical
clustering of local updates to improve training on non-iid data. In 2020 International Joint
27
Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020. 3.1
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. Language models are few-shot learners. In H. Larochelle, M. Ran-
zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Infor-
mation Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.,
2020. URL https://proceedings.neurips.cc/paper_files/paper/
2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. 1, 2.2
[9] Mingzhe Chen, Nir Shlezinger, H. Vincent Poor, Yonina C. Eldar, and Shuguang Cui.
Communication-efficient federated learning. Proceedings of the National Academy of
Sciences, 118(17):e2024789118, 2021. doi: 10.1073/pnas.2024789118. URL https:
//www.pnas.org/doi/abs/10.1073/pnas.2024789118. 3.1
[10] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural net-
work pruning-taxonomy, comparison, analysis, and recommendations. arXiv preprint
arXiv:2308.06767, 2023. 2.3
[11] Don Kurian Dennis, Tian Li, and Virginia Smith. Heterogeneity for the win: One-
shot federated clustering. In Marina Meila and Tong Zhang, editors, Proceedings of
the 38th International Conference on Machine Learning, volume 139 of Proceedings of
Machine Learning Research, pages 2611–2620. PMLR, 18–24 Jul 2021. URL https:
//proceedings.mlr.press/v139/dennis21a.html. 3.1
[12] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient
finetuning of quantized llms. Advances in Neural Information Processing Systems, 36,
2024. 6
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for
image recognition at scale. In International Conference on Learning Representations, 2021.
URL https://openreview.net/forum?id=YicbFdNTTy. 2.2
[14] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning
neural networks at initialization: Why are we missing the mark? In International Confer-
ence on Learning Representations, 2021. URL https://openreview.net/forum?
id=Ig-VyQc-MLK. 2.3
[15] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An ef-
ficient framework for clustered federated learning. In H. Larochelle, M. Ran-
zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Informa-
tion Processing Systems, volume 33, pages 19586–19597. Curran Associates, Inc.,
2020. URL https://proceedings.neurips.cc/paper_files/paper/
28
2020/file/e32cc80bf07915058ce90722ee17bb71-Paper.pdf. 2.1.1
[16] Neel Guha, Ameet Talwalkar, and Virginia Smith. One-shot federated learning. arXiv
preprint arXiv:1902.11175, 2019. 3.1
[17] Jenny Hamer, Mehryar Mohri, and Ananda Theertha Suresh. FedBoost: A communication-
efficient algorithm for federated learning. In Hal Daumé III and Aarti Singh, editors, Pro-
ceedings of the 37th International Conference on Machine Learning, volume 119 of Pro-
ceedings of Machine Learning Research, pages 3973–3983. PMLR, 13–18 Jul 2020. URL
https://proceedings.mlr.press/v119/hamer20a.html. 3.1
[18] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter-efficient fine-tuning
for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024. 2.2
[19] Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. SparseAdapter: An
easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zor-
nitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Lin-
guistics: EMNLP 2022, pages 2184–2190, Abu Dhabi, United Arab Emirates, December
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.
160. URL https://aclanthology.org/2022.findings-emnlp.160. 3.3
[20] Agrin Hilmkil, Sebastian Callh, Matteo Barbieri, Leon René Sütfeld, Edvin Listo Zec, and
Olof Mogren. Scaling federated learning for fine-tuning of large language models. In Elis-
abeth Métais, Farid Meziane, Helmut Horacek, and Epaminondas Kapetanios, editors, Nat-
ural Language Processing and Information Systems, pages 15–23, Cham, 2021. Springer
International Publishing. ISBN 978-3-030-80599-9. 3.2
[21] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Larous-
silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer
learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceed-
ings of the 36th International Conference on Machine Learning, volume 97 of Proceed-
ings of Machine Learning Research, pages 2790–2799. PMLR, 09–15 Jun 2019. URL
https://proceedings.mlr.press/v97/houlsby19a.html. 1, 3.2
[22] Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data
distribution for federated visual classification. 2019. URL https://arxiv.org/abs/
1909.06335. 5.2, 5.2
[23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,
Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod-
els. In International Conference on Learning Representations, 2022. URL https:
//openreview.net/forum?id=nZeVKeeFYf9. 1, 2.2, 3.2
[24] Dzmitry Huba, John Nguyen, Kshitiz Malik, Ruiyu Zhu, Mike Rabbat, Ashkan Yousefpour,
Carole-Jean Wu, Hongyuan Zhan, Pavel Ustinov, Harish Srinivas, et al. Papaya: Practical,
private, and scalable federated learning. Proceedings of Machine Learning and Systems, 4:
814–832, 2022. 1
[25] Sohei Itahara, Takayuki Nishio, Yusuke Koda, Masahiro Morikura, and Koji Yamamoto.
Distillation-based semi-supervised federated learning for communication-efficient collab-
orative training with non-iid private data. IEEE Transactions on Mobile Computing, 22
29
(1):191–205, January 2023. ISSN 2161-9875. doi: 10.1109/tmc.2021.3070013. URL
http://dx.doi.org/10.1109/TMC.2021.3070013. 3.1
[26] Yuang Jiang, Shiqiang Wang, Vı́ctor Valls, Bong Jun Ko, Wei-Han Lee, Kin K. Leung, and
Leandros Tassiulas. Model pruning enables efficient federated learning on edge devices.
IEEE Transactions on Neural Networks and Learning Systems, 34(12):10374–10386, 2023.
doi: 10.1109/TNNLS.2022.3166101. 3.1
[27] Zhida Jiang, Yang Xu, Hongli Xu, Zhiyuan Wang, Chunming Qiao, and Yangming Zhao.
Fedmp: Federated learning through adaptive model pruning in heterogeneous edge com-
puting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages
767–779, 2022. doi: 10.1109/ICDE53745.2022.00062. 3.1
[28] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon
Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural
language models. arXiv preprint arXiv:2001.08361, 2020. 2.2
[29] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT,
volume 1, page 2, 2019. 1, 2.2
[30] Kevin Kuo, Arian Raje, Kousik Rajesh, and Virginia Smith. Sparsity for communication-
efficient loRA. In 5th Workshop on practical ML for limited/low resource settings, 2024.
URL https://openreview.net/forum?id=wibit67d29. 4.3
[31] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient
prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-
tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Repub-
lic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
3.2
[32] Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen,
and Tuo Zhao. Loftq: LoRA-fine-tuning-aware quantization for large language models. In
The Twelfth International Conference on Learning Representations, 2024. URL https:
//openreview.net/forum?id=LzPWWPAdY4. 6
[33] Ji Lin, Ligeng Zhu, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. On-
device training under 256kb memory. In Proceedings of the 36th International Conference
on Neural Information Processing Systems, NeurIPS ’22, Red Hook, NY, USA, 2024. Cur-
ran Associates Inc. ISBN 9781713871088. 6
[34] Xiaonan Liu, Shiqiang Wang, Yansha Deng, and Arumugam Nallanathan. Adaptive fed-
erated pruning in hierarchical wireless networks. IEEE TRANSACTIONS ON WIRELESS
COMMUNICATIONS, page 1, 2023. ISSN 1536-1276. doi: 10.1109/TWC.2023.3329450.
Publisher Copyright: IEEE. 3.1
[35] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera
y Arcas. Communication-efficient learning of deep networks from decentralized data. In
Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017. 1, 2.1.1, 2.1.2, 3.1
30
[36] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differen-
tially private recurrent language models. In International Conference on Learning Repre-
sentations, 2018. URL https://openreview.net/forum?id=BJ0hF1Z0b. 5,
5.4
[37] Michela Paganini and Jessica Forde. On iterative neural network pruning, reinitialization,
and the similarity of masks. arXiv preprint arXiv:2001.05050, 2020. 2.3
[38] Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison,
H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Guha Thakurta.
How to dp-fy ml: A practical guide to machine learning with differential privacy. Jour-
nal of Artificial Intelligence Research, 77:1113–1201, July 2023. ISSN 1076-9757. doi:
10.1613/jair.1.14649. URL http://dx.doi.org/10.1613/jair.1.14649. 5.4
[39] Libo Qin, Qiguang Chen, Yuhang Zhou, Zhi Chen, Yinghui Li, Lizi Liao, Min Li, Wanx-
iang Che, and Philip S Yu. Multilingual large language model: A survey of resources,
taxonomy and frontiers. arXiv preprint arXiv:2404.04925, 2024. 2.2
[40] Sashank Reddi, Zachary Burr Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub
Konečný, Sanjiv Kumar, and Brendan McMahan, editors. Adaptive Federated Optimiza-
tion, 2021. URL https://openreview.net/forum?id=LkFG3lB13U5. 2.1.2
[41] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin
Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic
averaging and quantization. In Silvia Chiappa and Roberto Calandra, editors, Proceedings
of the Twenty Third International Conference on Artificial Intelligence and Statistics, vol-
ume 108 of Proceedings of Machine Learning Research, pages 2021–2031. PMLR, 26–28
Aug 2020. URL https://proceedings.mlr.press/v108/reisizadeh20a.
html. 3.1
[42] Osama Shahid, Seyedamin Pouriyeh, Reza M. Parizi, Quan Z. Sheng, Gautam Srivastava,
and Liang Zhao. Communication efficiency in federated learning: Achievements and chal-
lenges, 2021. 5.3
[43] Guangyu Sun, Umar Khalid, Matias Mendieta, Taojiannan Yang, and Chen Chen. Conquer-
ing the communication constraints to enable large pre-trained models in federated learning.
arXiv preprint arXiv:2210.01708, 2024. 3.2
[44] Youbang Sun, Zitao Li, Yaliang Li, and Bolin Ding. Improving loRA in privacy-preserving
federated learning. In The Twelfth International Conference on Learning Representations,
2024. URL https://openreview.net/forum?id=NLPzL6HWNl. 5.4
[45] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized feder-
ated learning. IEEE Transactions on Neural Networks and Learning Systems, 34(12):
9587–9603, December 2023. ISSN 2162-2388. doi: 10.1109/tnnls.2022.3160699. URL
http://dx.doi.org/10.1109/TNNLS.2022.3160699. 1
[46] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
2023. 2.2
31
[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon,
U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed-
itors, Advances in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/
2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. 1, 2.2
[48] Luping WANG, Wei WANG, and Bo LI. Cmfl: Mitigating communication overhead for
federated learning. In 2019 IEEE 39th International Conference on Distributed Computing
Systems (ICDCS), pages 954–964, 2019. doi: 10.1109/ICDCS.2019.00099. 3.1
[49] Wei Wang and Liqiang Zhu. Structured feature sparsity training for convolutional neu-
ral network compression. Journal of Visual Communication and Image Representa-
tion, 71:102867, 2020. ISSN 1047-3203. doi: https://doi.org/10.1016/j.jvcir.2020.
102867. URL https://www.sciencedirect.com/science/article/pii/
S1047320320301176. 2.3
[50] Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Bin-
bin Lin, Deng Cai, and Xiaofei He. Model compression and efficient inference for large
language models: A survey. arXiv preprint arXiv:2402.09748, 2024. 2.3
[51] Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao,
Ahmed Hassan Awadallah, and Jianfeng Gao. AdaMix: Mixture-of-adaptations for
parameter-efficient model tuning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang,
editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, pages 5744–5760, Abu Dhabi, United Arab Emirates, December 2022. As-
sociation for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.388. URL
https://aclanthology.org/2022.emnlp-main.388. 1
[52] Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language
models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP). Association for Computational Linguistics, 2020. doi:
10.18653/v1/2020.emnlp-main.496. URL http://dx.doi.org/10.18653/v1/
2020.emnlp-main.496. 2.3
[53] Chuhan Wu, Fangzhao Wu, Lingjuan Lyu, Yongfeng Huang, and Xing Xie.
Communication-efficient federated learning via knowledge distillation. Nature Commu-
nications, 13(1), April 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29763-x. URL
http://dx.doi.org/10.1038/s41467-022-29763-x. 3.1
[54] Jiarun Wu and Qingliang Chen. Pruning adapters with lottery ticket. Algorithms, 15(2),
2022. ISSN 1999-4893. doi: 10.3390/a15020063. URL https://www.mdpi.com/
1999-4893/15/2/63. 3.3
[55] Guang Yang, Ke Mu, Chunhe Song, Zhijia Yang, and Tierui Gong. Ringfed: Re-
ducing communication costs in federated learning on non-iid data. arXiv preprint
arXiv:2107.08873, 2021. 3.1
[56] Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang,
Shaochen Zhong, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A
32
survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data, 18(6), apr 2024. ISSN
1556-4681. doi: 10.1145/3649506. URL https://doi.org/10.1145/3649506.
2.2
[57] Zhengwu Yang and Han Zhang. Comparative analysis of structured pruning and unstruc-
tured pruning. In Jason C. Hung, Neil Y. Yen, and Jia-Wei Chang, editors, Frontier Com-
puting, pages 882–889, Singapore, 2022. Springer Nature Singapore. ISBN 978-981-16-
8052-6. 2.3
[58] Zhuo Zhang, Yuanhang Yang, Yong Dai, Qifan Wang, Yue Yu, Lizhen Qu, and Zenglin
Xu. FedPETuning: When federated learning meets the parameter-efficient tuning meth-
ods of pre-trained language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023,
pages 9963–9977, Toronto, Canada, July 2023. Association for Computational Linguis-
tics. doi: 10.18653/v1/2023.findings-acl.632. URL https://aclanthology.org/
2023.findings-acl.632. 3.2
[59] Yong Zhou, Yuanming Shi, Haibo Zhou, Jingjing Wang, Liqun Fu, and Yang Yang. Toward
scalable wireless federated learning: Challenges and solutions. IEEE Internet of Things
Magazine, 6(4):10–16, 2023. doi: 10.1109/IOTM.001.2300099. 1
[60] Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compres-
sion for large language models. arXiv preprint arXiv:2308.07633, 2023. 2.3
[61] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. ToolQA: A dataset
for LLM question answering with external tools. In Thirty-seventh Conference on Neural
Information Processing Systems Datasets and Benchmarks Track, 2023. URL https:
//openreview.net/forum?id=pV1xV2RK6I. 2.2
33