0% found this document useful (0 votes)

28 views15 pages

3D_Segmentation

The document presents FLAP-SAM, a federated learning approach that combines parameter-efficient fine-tuning techniques with the Segment Anything Model (SAM) for 3D medical image segmentation. By utilizing Low-Rank Adapters (LoRA) and selectively fine-tuning specific layers, the method significantly reduces communication costs while improving segmentation performance. Experimental results demonstrate a 48× reduction in communication costs and a 6% increase in Dice score compared to traditional fine-tuning methods.

Uploaded by

xiaoyvonne8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views15 pages

3D_Segmentation

Uploaded by

xiaoyvonne8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

A Federated Learning-Friendly Approach for

Parameter-Efficient Fine-Tuning of SAM in

3D Segmentation

Mothilal Asokan , Joseph Geo Benjamin (Q) , Mohammad Yaqub , and

Karthik Nandakumar
arXiv:2407.21739v1 [cs.CV] 31 Jul 2024

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI),

Abu Dhabi, United Arab Emirates
{mothilal.asokan, joseph.benjamin, mohammad.yaqub,
karthik.nandakumar}@mbzuai.ac.ae

Abstract. Adapting foundation models for medical image analysis re-

quires finetuning them on a considerable amount of data because of ex-
treme distribution shifts between natural (source) data used for pretrain-
ing and medical (target) data. However, collecting task-specific medical
data for such finetuning at a central location raises many privacy con-
cerns. Although Federated learning (FL) provides an effective means for
training on private decentralized data, communication costs in federat-
ing large foundation models can quickly become a significant bottleneck,
impacting the solution’s scalability. In this work, we address this prob-
lem of ‘efficient communication while ensuring effective learning in FL’ by
combining the strengths of Parameter-Efficient Fine-tuning (PEFT) with
FL. Specifically, we study plug-and-play Low-Rank Adapters (LoRA) in
a federated manner to adapt the Segment Anything Model (SAM) for
3D medical image segmentation. Unlike prior works that utilize LoRA
and finetune the entire decoder, we critically analyze the contribution of
each granular component of SAM on finetuning performance. Thus, we
identify specific layers to be federated that are very efficient in terms of
communication cost while producing on-par accuracy. Our experiments
show that retaining the parameters of the SAM model (including most of
the decoder) in their original state during adaptation is beneficial because
fine-tuning on small datasets tends to distort the inherent capabilities of
the underlying foundation model. On Fed-KiTS, our approach decreases
communication cost (∼48× ↓) compared to full fine-tuning while in-
creasing performance (∼6% ↑ Dice score) in 3D segmentation tasks. Our
approach performs similar to SAMed while achieving ∼2.8× reduction in
communication and parameters to be finetuned. We further validate our
approach with experiments on Fed-IXI and Prostate MRI datasets. Our
code is available at https://github.com/BioMedIA-MBZUAI/FLAP-SAM.

Keywords: Federated Learning · Foundation Model · 3D Medical Image

Segmentation · Parameter-Efficient Fine-Tuning
2 M. Asokan et al.

Trainable
Dice score after Transmission overhead for
Server /
25-rounds of FedAvg
3D INPUT SCAN

Global Aggregator Frozen

FedAvg in Gigabytes
(higher the better) (lower the better)
SAM
Adapter
Encoder 54.50% Full FT
[+9.51%] 67.45GB
57.20% Attention FT

Prompt Enc
SAM [+8.86%] 23.60GB (2.8 × )
Adapter Decoder
50.70% Full Decoder FT
DeConV HYP
[+8.55%] 2.91GB (23.2 × )
Client A 45.40% Partial Decoder FT
SEGMENTATION
MASK [+7.72%] 0.36GB (187.4 × )
3D INPUT SCAN 3D INPUT SCAN
36.90% LoRA FT
[ 0.30%] 1.02GB (66.1 × )
59.80% LoRA + Decoder FT
Adapter
SAM
Adapter
SAM [+9.34%] 3.93GB (17.2 × )
Encoder Encoder
60.50% FLAP-SAM (Ours)
[+9.77%] 1.38GB (48.9 × )
Prompt Enc

Prompt Enc
SAM SAM
[+/ ] indicates (× ) indicates reduction
Adapter Decoder Adapter Decoder
Collaborative gains in Communication Cost
in Dice score for compared to Full FT
FedAvg compared to
DeConV HYP DeConV HYP

(higher the better)

Client C Client B Localized training
SEGMENTATION
MASK
SEGMENTATION
MASK (higher the better)

Fig. 1. (Left) The proposed FLAP-SAM framework and (Right). The comparison (in
terms of Dice score(%), Collaborative gains(%) and Communication Cost(× ↓)) of our
method against other fine-tuning methods on Fed-KITS2019[26].

1 Introduction
Segmentation is one of the cornerstone tasks in modern medical image analysis
for automated diagnosis and disease monitoring. While the advent of founda-
tional models has pushed the boundaries of the state-of-the-art in many com-
puter vision applications, such benefits are yet to transfer fully to the medical
imaging domain [33]. For example, the Segment Anything Model (SAM) [14] has
an excellent zero-shot generalization to new distributions and tasks involving
natural images. However, the SAM model fails to generalize well across diverse
medical imaging modalities due to the insurmountable distribution shifts [11].
Works like [6,10] highlight a substantial performance gap between the zero-shot
inference and training on domain-specific medical images despite using various
prompts in SAM. MSA [29] and SAM-Med2D [4] improve SAM by using tailored
prompting techniques in 2D medical images. However, creating such prompts for
each 2D slice of 3D data is labor-intensive.
The usual approach has been fine-tuning SAM for the target application
using large task-specific datasets e.g., MedSAM[20]. For tasks with significant
distribution shifts with limited data, the decoder block of SAM is fine-tuned
while leaving the encoder untouched. In contrast to fine-tuning all parameters,
parameter-efficient fine-tuning(PEFT) methods fine-tune only a minimal num-
ber of parameters using representative data such as prompt tuning [12] and
low-rank adapters (LoRA) [9]. Several works [22,28,32] have employed LoRA for
fine-tuning, resulting in superior performance in various 2D segmentation tasks.
For 3d segmentation, [2] uses factor tuning (FacT) adapters [13] for training.
FLAP-SAM 3

However, accessing diverse medical data for such finetuning is not always feasi-
ble because relevant datasets specific to a medical task may not be available with
any single entity. It is often spread across multiple institutions and cannot be
shared due to privacy and confidentiality constraints [26]. While the data decen-
tralization issue can be handled effectively with federated learning (FL) [21,16],
the ginormous parameter size of SAM (∼100M-700M) makes it impractical for
FL as it imposes a substantial communication cost.
In this work, we uncover the importance of fine-tuning various components
of SAM when adapted for 3D medical segmentation and propose FLAP-SAM,
a PEFT approach involving LoRA that is amenable to FL (see Fig 1). More-
over, high parameter efficiency reduces communication costs in FL and prevents
overfitting in data-limited scenarios. With methods like MA-SAM[2], which uses
FacT adapters for 3D segmentation, it is challenging to federate because of the
tensor decomposition involved. Hence, we employ LoRA, which is FL-friendly,
to customize SAM for 3D medical image segmentation. SAMed[32] uses a similar
approach where LoRA and the entire SAM decoder are fine-tuned for 2D seg-
mentation. Our approach selectively finetunes certain decoder parts, reducing
parameters and communication costs.

2 Preliminaries

Overview of SAM: The architecture of SAM[14] can be decoupled into three

major components: the Image Encoder (IE) to compute image embeddings, the
Prompt Encoder (PE) to generate prompt embeddings, and the Mask Decoder
(MD) that combines the image and prompt embeddings to generate segmentation
masks as shown in Fig 2. Utilizing ViT [5] as the backbone, IE extracts image
features through a sequence of L transformer blocks. Meanwhile, PE takes var-
ious input prompts in the form of points, boxes, or masks and encodes them
into prompt embeddings to aid in segmentation tasks. We operate SAM in the
fully automatic mode, in which a regular grid of foreground points is presented
as input prompts to the PE, thus eliminating the dependence on user-defined
prompts. MD performs cross-attention between the image and prompt embed-
dings, employing transposed convolutional layers for up-scaling back to image
dimension (UP) and a hyper multi-layer perceptron (HYP) to produce segmenta-
tion masks. Following [2], we use a slightly modified SAM mask decoder that
has two additional transposed convolutional layers, which up-sample the fea-
ture maps by 16× to match the resolution of the input while ensuring improved
discrimination of small anatomical structures or lesions in medical images [24].
For simplicity, let θIE , θPE , and θMD denote the parameters of IE, PE, and
MD, respectively. Also, θIE can be further partitioned as θIE-AT and θIE-NA , where
θIE-AT denotes the parameters of all the attention layers within IE and θIE-NA
represents all the other parameters in IE not related to the attention layers.
On the other hand, θMD can be partitioned as θMD-TR , θMD-UP , and θMD-HYP , where
θMD-TR denotes the parameters of all the transformer blocks within MD (such as
self-attention, cross-attention from tokens to image embeddings (t2i), and cross-
4 M. Asokan et al.

attention from image embeddings to tokens (i2t)), θMD-UP denotes the parameters
of the transposed convolutional layers used for upscaling, and θMD-HYP represents
the parameters of HYP.
PEFT Formulation: The most straightforward approach to adapt a SAM
model for a downstream task is to fine-tune all its parameters, including θIE , θPE ,
and θMD . This full fine-tuning (FullFT) strategy requires more memory footprint
to store a copy of all the updated parameters and often leads to overfitting when
the data is severely limited. Recent works have shown that fine-tuning only the
attention layers of a transformer encoder is sufficient for good adaptation [27].
This approach is called attention fine-tuning (AttnFT), where only θIE-AT and
θMD are updated. Typically, the attention-related parameters of a transformer en-
coder constitute one-third of its overall parameters. Thus, AttnFT leads to some
improvement in parameter efficiency and generally provides good performance
on downstream tasks. Another common approach is to freeze the image encoder
and fine-tune the entire mask decoder. This is because the cross-attention layers
in the decoder focus on specific patches in the image embeddings corresponding
to the prompts and transform them into segmentation predictions [31]. We call
this approach DecFT, where θMD is updated. It is also possible to freeze all the
parameters and fine-tune only the output layers, namely, UP and HYP. This ap-
proach is analogous to linear probing in classification tasks, which we refer to as
partial decoder fine tuning (PDecFT). Since only a small fraction of parameters,
namely, θMD-UP , and θMD-HYP are updated, PDecFT has high parameter efficiency
but usually provides only sub-optimal performance.
LoRA adapters: Low-rank adaptation [9] is a promising PEFT technique
widely used for adapting foundation models to downstream tasks. Each atten-
tion layer within a transformer block has four weight (projection) matrices Wq
′
(query), Wk (key), Wv (value), and Wo (output), where each W ∈ Rd×d . The
core idea of LoRA is to constrain the modifications to a pre-trained weight ma-
trix W to a linear update matrix ∆W , which can be further constrained using
′
a low-rank decomposition, i.e., ∆W = BA, where B ∈ Rd×r , A ∈ Rr×d and
the rank r ≪ min{d, d′ }. This approach effectively reduces the parameter space
while preserving the essential information needed for adaptation. W is frozen
during fine-tuning, and only A and B matrices are updated. For input x, the
output x̃ is computed as follows:

x̃ = (W + α∆W )x = W x + α∆W x = W x + αBAx, (1)

where α is a scale parameter. Following [9], we employ LoRA only to the pro-
jection matrices Wq and Wv in all the Γ attention layers of IE and MD. Let θLoRA
represent the set of all LoRA parameters, where θLoRA = {Aℓq , Aℓv , Bqℓ , Bvℓ }Γℓ=1 .
When only the LoRA parameters are updated during fine-tuning, we refer to
this case as LoRAFT. While this approach also has good parameter efficiency,
it typically results in sub-optimal performance for segmentation tasks. To im-
prove this, in SAMed[32], both the decoder and LoRA are fine-tuned. Here, θMD
and θLoRA parameters are updated together, and we represent this approach as
LoRADecFT.
FLAP-SAM 5

Federated Learning: In an FL system, K clients can collaboratively train a

global model with parameters Θ. The goal is to solve the following optimization
problem:
K
1 X seg
min Lk (Θ), (2)
Θ∈RT K
k=1
seg seg
where Lk (Θ) = Ex∼Dk [Lk (Θ; x)] is the loss function of the k th client (k ∈
[1, K]), Dk represents the data distribution of the k th client, and T = |Θ| is the
number of parameters that need to be learned. Note that if the distributions
Di and Dj are different for clients i and j, the scenario is referred to as non-iid
(not independent and identically distributed). A widely used method for solving
the optimization problem is FedAvg [21]. At each round, the server broadcasts
the global model to each client. Then, all clients conduct local training on their
data and send back the updated model to the server. Finally, the server updates
the global model as a weighted average of these local model updates. The server
update at round n in FedAvg can be formulated as follows:
K
X
Θn+1 = αk Θkn , (3)
k=1

where Θkn denotes the local model of k th client in round n and αk is the weight
assigned for each client.

3 Proposed FLAP-SAM Approach

We aim to efficiently and effectively adapt the SAM for medical image seg-
mentation tasks using limited data distributed across multiple entities. Based
on this consideration, we propose to update both θLoRA as well as the final de-
coder output layers θMD-UP and θMD-HYP . This leads to the proposed fine-tuning
for SAM, a hybrid of PDecFT and LoRAFT, as shown in Fig 2. Though there
is a marginal increase in the number of parameters compared to LoRAFT, the
proposed approach performs well because FLAP-SAM provides enough flexibil-
ity to be effectively fine-tuned. Moreover, since almost all the parameters of the
original foundation model are retained without modification, its inherent ca-
pabilities remain unaffected. Another benefit is its memory efficiency, and the
small parameter size of the proposed adapter makes it possible to learn them
collaboratively via FL while greatly reducing communication costs.
When aggregating LoRA parameters (θLoRA ), the k th client sends {Aℓq,k , Aℓv,k ,
ℓ ℓ
Bq,k , Bv,k }Γℓ=1 to the server. The server first needs to reconstruct ∆Wq,k ℓ
=
ℓ ℓ ℓ ℓ ℓ
Bq,k · Aq,k and ∆Wv,k = Bv,k · Av,k for each ℓ and k, then performs FedAvg
as shown in Eq. (3) to get the aggregated global weight matrices ∆Wqℓ and
∆Wvℓ of each attention layer ℓ. Finally, the server applies singular value decom-
position to decompose the aggregated matrices back to global LoRA parameters
{Aℓq , Bqℓ , Aℓv , Bvℓ }Γℓ=1 , which are sent back to the clients. We refer to this federated
learning of plug-and-play SAM adapter as FLAP-SAM.
6 M. Asokan et al.

3D Input Segmentation
Scan Masks

Image Encoder Mask Decoder

Trainable
Linear Projection
Hyper Frozen
DeConv
Patch+Position MLP
Embedding

Attn block

Layer Norm T2I query key value

Cross Attn
LoRA LoRA

Multi-Head
Self Attn A A
Wq Wk Wv
I2T B B
MLP
Cross Attn

Layer Norm
T2I
Cross Attn Scaled Dot
Feed Forward Product Attention

Self
x12 x2 Attn Wo

Image Embedding
Prompt Encoder
(Automatic mode)

Fig. 2. Architecture of the proposed plug-and-play adapter. Only the decoder output
layers and LoRAs are fine-tuned, while the remaining parameters in SAM are frozen.

4 Experiments

Datasets: We utilize Fed-KITS2019 , a 6-client federated version of the KiTS19

dataset from FLamby [26], which was created from the Kidney Tumor Segmen-
tation Challenge 2019 in CT scans [7,8]. Each client’s train/test split is 9/3,
11/3, 9/3, 9/3, 12/4, and 24/6. The preprocessing pipeline comprises intensity
clipping (5th and 95th percentile of image intensities of each client were cal-
culated) followed by z-scale normalization, where we subtract the mean and
divide by the standard deviation of the image intensities. Fed-IXI , extracted
from the Information eXtraction from Images - IXI database [23,25] of brain T1
MRIs from 3 hospitals (Guys, HH, and IOP) contains 249/62, 145/36 and 59/15
train/test splits respectively. In a preprocessing step, min-max normalization
was applied to each scan and padded with zeros in the axial plane (final shape
83 × 64 × 64). Prostate MRI is a multi-site segmentation dataset proposed
by Liu et al. [18], comprises prostate T2-weighted MRI data from six different
data sources (i.e., Site A to F) out of the three publicly available datasets: NCI-
ISBI13 dataset [1], I2CVB dataset [15] and PROMISE12 dataset [17]. Each site
has 30, 30, 19, 13, 12, 12 MRI scans of patients respectively and were randomly
divided into train(≈ 80%) and test(≈ 20%) sets. Since they were acquired with
varying imaging protocols and contain heterogeneous data distributions, we nor-
malized each site to zero mean and unit variance to reduce the intensity variance
among different sites. We resized it to 224 × 224 in the axial plane.
FLAP-SAM 7

Table 1. Comparison on all datasets for different fine-tuning methods. ‡− FL setting

of MA-SAM is not feasible since decomposing FacT tensors after aggregation is not
possible; a centralized score is provided for performance comparison in 3D segmenta-
tion. ∗ ∗ − Parameter counts for single class segmentation task (add 0.134M params
for each additional class). The baseline (full fine-tuning) is highlighted in Blue and
our method in Pink .

Mean Dice score Trainable/

Experiments Setting
Fed-KiTS Fed-IXI Prost.MRI Total params∗∗
FullFT Local 0.4493 0.9777 0.8421
(baseline) Federated 0.5444 0.9811 0.9084 90.399M/90.399M
{θIE , θPE , θMD } Centralized 0.5274 0.9834 0.8955 (100%)
Local 0.4838 0.8848 0.6315
AttnFT
Federated 0.5724 0.9674 0.8797 29.575M/90.399M
{θIE-AT , θMD }
Centralized 0.5486 0.9774 0.8957 (32.7%)
Local 0.4213 0.9750 0.8200
DecFT
Federated 0.5068 0.9771 0.8101 3.768M/90.399M
{θMD }
Centralized 0.5179 0.9789 0.8587 (4.2%)
Local 0.3717 0.9728 0.8386
LoRAFT
Federated 0.3687 0.9777 0.8578 1.368M/91.767M
{θLoRA }
Centralized 0.5242 0.9798 0.8893 (1.5%)
LoRADecFT Local 0.5053 0.9829 0.8929
(SAMed)[20] Federated 0.5987 0.9836 0.8949 5.270M/91.767M
{θLoRA , θMD } Centralized 0.6100 0.9852 0.9039 (5.8%)
Local 0.3764 0.9678 0.7890
PDecFT
Federated 0.4536 0.9693 0.7017 0.344M/90.399M
{θUP , θHYP }
Centralized 0.4793 0.9711 0.8008 (0.4%)
FLAP-SAM Local 0.5069 0.9829 0.8845
(ours) Federated 0.6046 0.9834 0.8867 1.712M/91.767M
{θLoRA , θUP , θHYP } Centralized 0.5980 0.9851 0.9044 (1.9%)
28.667M/115.298M
MA-SAM‡ [2] Centralized 0.6023 0.9707 0.9125
(25%)

Implementation details: We follow the input format and data augmentation

as described in [2], conducting all experiments using the “vit_b” version of
SAM on an NVIDIA A100-SXM4-40GB GPU. The input to the model is of size
(N × H × W ), which consists of every set of N consecutive slices (N = 5).
In LoRA, we initialize matrix A from a random Gaussian distribution while
setting matrix B to zero and rank to 32. The fine-tuning process employs a
hybrid segmentation loss, combining cross-entropy loss and Dice loss as Lseg =
αLCE + βLDice , with weighting factors α = 0.2 and β = 0.8 following [32].
The training utilizes the Adam optimizer with a batch size of 32. We compare
federated learning with localized (using client-owned data alone) and centralized
8 M. Asokan et al.

(all data pooled) settings. We test each site data separately for all fine-tuning
strategies, and the results are tabulated in Table 1.

Table 2. Average Dice score across local test data in the federated setting on Fed-
KiTS19 dataset. (Left) different rank values of LoRA; (Right) different low-rank
adapter methods.

LoRA Mean Trainable / Mean Trainable /

Adapter
rank Dice Total params Dice Total params
32 0.605 1.846M/91.901M LoRA 0.605 1.846M/91.901M
16 0.599 1.162M/91.217M DoRA 0.592 1.846M/91.901M
4 0.600 0.649M/90.704M MoLE 0.603 4.583M/94.637M

4.1 Results and Discussion

Our proposed FLAP-SAM method achieves 6% absolute improvement in Dice
score compared to the FullFT approach, with a ∼49× reduction in communi-
cation overhead on Fed-KITS. Due to the small size of the dataset, the FullFT
approach easily results in overfitting, highlighting the importance of using PEFT
methods in limited data settings [3]. The Attention fine-tuning (AttnFT) only
achieves half of our improvement (2.8% less than FLAP-SAM) and still incurs
∼17× more communication cost than our method. Both LoRAFT and PDecFT
are more efficient but have lower Dice scores than our method. The LoRADecFT
achieves an equivalent dice score to our method, but our method is ∼2.8× more
efficient regarding parameters and communication. We conduct ablation on the
rank parameter of LoRA in FL, and the results are shown in Table 2. We observe
that a lower LoRA rank significantly reduces the trainable parameters with a
marginal degradation in the Dice score. We also conduct experiments with other
low-rank adapters like DoRA[19] and MoLE[30], which only show marginal per-
formance differences among the adapters.
We also benchmark FLAP-SAM against MA-SAM [2], which uses a 3D
adapter along with FacT for fine-tuning. We perform this comparison only in the
centralized setting because the decomposition of ∆W back to FacT-Tensor-Train
or FacT-Tucker formats [13] after federated aggregation is not straightforward.
Although MA-SAM produces comparable results to our method (see Table 1)
in a centralized setting, it uses 28.7M trainable parameters (∼16× more than
our method). This validates our choice of using LoRA, which is both parameter-
efficient and FL-friendly.

5 Conclusion
In this work, we have tackled adapting a foundational segmentation model
(SAM) for 3D medical image segmentation by incorporating an effective PEFT
FLAP-SAM 9

strategy. We critically analyze the LoRA adapter’s impact and various SAM
components to make the fine-tuning for dense 3D segmentation tasks amenable
to FL. Our approach simultaneously addresses data scarcity, overfitting, and
communication overhead challenges, resulting in a practical and cost-efficient
solution. Our current work analyses various fine-tuning methods in the context
of FedAVG[21]; an interesting future direction would be studying the effects of
various federated optimization strategies on low-rank adapters for datasets with
considerable distribution shifts.

Disclosure of Interests. The authors have no competing interests to declare that

are relevant to the content of this article.

References
1. Bloch, N., Madabhushi, A., Huisman, H., Freymann, J., Kirby, J., Grauer, M.,
Enquobahrie, A., Jaffe, C., Clarke, L., Farahani, K.: Nci-isbi 2013 challenge: auto-
mated segmentation of prostate structures. The Cancer Imaging Archive 370(6),
5 (2015)
2. Chen, C., Miao, J., Wu, D., Yan, Z., Kim, S., Hu, J., Zhong, A., Liu, Z., Sun,
L., Li, X., et al.: Ma-sam: Modality-agnostic sam adaptation for 3d medical image
segmentation. arXiv preprint arXiv:2309.08842 (2023)
3. Chen, G., Liu, F., Meng, Z., Liang, S.: Revisiting parameter-efficient tuning: Are
we really there yet? In: Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing. pp. 2612–2626 (2022)
4. Cheng, D., Qin, Z., Jiang, Z., Zhang, S., Lao, Q., Li, K.: Sam on medical images:
A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035
(2023)
5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. In:
ICLR (2021)
6. He, S., Bao, R., Li, J., Grant, P.E., Ou, Y.: Accuracy of segment-anything model
(sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324
(2023)
7. Heller, N., Isensee, F., Maier-Hein, K.H., Hou, X., Xie, C., Li, F., Nan, Y., Mu,
G., Lin, Z., Han, M., et al.: The state of the art in kidney and kidney tumor
segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge.
Medical Image Analysis p. 101821 (2020)
8. Heller, N., Sathianathen, N., Kalapara, A., Walczak, E., Moore, K., Kaluzniak,
H., Rosenberg, J., Blake, P., Rengel, Z., Oestreich, M., et al.: The kits19 challenge
data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and
surgical outcomes. arXiv preprint arXiv:1904.00445 (2019)
9. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)
10. Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J.,
Chen, J., Chen, C., Liu, S., Chi, H., Hu, X., Yue, K., Li, L., Grau, V., Fan, D.P.,
Dong, F., Ni, D.: Segment anything model for medical images? Medical Image
Analysis 92, 103061 (2024)
10 M. Asokan et al.

11. Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith,
K.: Are natural domain foundation models useful for medical image classification?
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision. pp. 7634–7643 (2024)
12. Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.:
Visual prompt tuning. In: ECCV. pp. 709–727. Springer (2022)
13. Jie, S., Deng, Z.H.: Fact: Factor-tuning for lightweight adaptation on vision trans-
former. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37,
pp. 1060–1068 (2023)
14. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp.
4015–4026 (2023)
15. Lemaı̂tre, G., Martı́, R., Freixenet, J., Vilanova, J.C., Walker, P.M., Meriaudeau,
F.: Computer-aided detection and diagnosis for prostate cancer based on mono
and multi-parametric mri: A review. Computers in Biology and Medicine 60, 8–31
(2015)
16. Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated
optimization in heterogeneous networks. Proceedings of Machine learning and sys-
tems 2, 429–450 (2020)
17. Litjens, G., Toth, R., Van De Ven, W., Hoeks, C., Kerkstra, S., Van Ginneken,
B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate
segmentation algorithms for mri: the promise12 challenge. Medical image analysis
18(2), 359–373 (2014)
18. Liu, Q., Dou, Q., Heng, P.A.: Shape-aware meta-learning for generalizing prostate
mri segmentation to unseen domains. In: MICCAI (2020)
19. Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T.,
Chen, M.H.: Dora: Weight-decomposed low-rank adaptation. arXiv preprint
arXiv:2402.09353 (2024)
20. Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical
images. Nature Communications 15(1), 654 (2024)
21. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.:
Communication-efficient learning of deep networks from decentralized data. In:
Artificial intelligence and statistics. pp. 1273–1282. PMLR (2017)
22. Paranjape, J.N., Nair, N.G., Sikder, S., Vedula, S.S., Patel, V.M.: Adaptivesam:
Towards efficient tuning of sam for surgical scene segmentation. arXiv preprint
arXiv:2308.03726 (2023)
23. Pérez-Garcı́a, F., Sparks, R., Ourselin, S.: Torchio: a python library for efficient
loading, preprocessing, augmentation and patch-based sampling of medical images
in deep learning. Computer Methods and Programs in Biomedicine 208, 106236
(2021)
24. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: MICCAI 2015: 18th International Conference, Munich,
Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
25. Team, B.D.: Ixi-dataset, https://brain-development.org/ixi-dataset/
26. Ogier du Terrail, J., Ayed, S.S., Cyffers, E., Grimberg, F., He, C., Loeb, R., Man-
gold, P., Marchand, T., Marfoq, O., Mushtaq, E., et al.: Flamby: Datasets and
benchmarks for cross-silo federated learning in realistic healthcare settings. Ad-
vances in Neural Information Processing Systems 35, 5315–5334 (2022)
27. Touvron, H., Cord, M., El-Nouby, A., Verbeek, J., Jégou, H.: Three things everyone
should know about vision transformers. In: European Conference on Computer
Vision. pp. 497–515. Springer (2022)
FLAP-SAM 11

28. Wang, A., Islam, M., Xu, M., Zhang, Y., Ren, H.: Sam meets robotic surgery: An
empirical study in robustness perspective. arXiv preprint arXiv:2304.14674 (2023)
29. Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical
sam adapter: Adapting segment anything model for medical image segmentation.
arXiv preprint arXiv:2304.12620 (2023)
30. Wu, X., Huang, S., Wei, F.: Mole: Mixture of lora experts. In: The Twelfth Inter-
national Conference on Learning Representations (2023)
31. Xie, W., Willems, N., Patil, S., Li, Y., Kumar, M.: Sam fewshot finetuning for
anatomical segmentation in medical images. In: Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision. pp. 3253–3261 (2024)
32. Zhang, K., Liu, D.: Customized segment anything model for medical image seg-
mentation. arXiv preprint arXiv:2304.13785 (2023)
33. Zhang, S., Metaxas, D.: On the challenges and perspectives of foundation models
for medical image analysis. Medical Image Analysis 91, 102996 (2024)
12 M. Asokan et al.

A Appendix

Table A. Results on Prostate MRI dataset for different fine-tuning methods. ‡− FL

setting of MA-SAM is not shown due to challenges in decomposing FacT tensors after
federated aggregation. The baseline (full fine-tuning) is highlighted in Blue and our
method in Pink . The best and second best methods for FL are highlighted in bold
and underline, respectively. Best viewed in color.

Experiments Dice score

Setting
(Trainable/Total params) SiteA SiteB SiteC SiteD SiteE SiteF Average
FullFT (Baseline) Local 0.9152 0.8551 0.8471 0.6781 0.9016 0.8556 0.8421
{θIE , θPE , θMD } Federated 0.9432 0.9109 0.8797 0.8984 0.9148 0.9036 0.9084
(90.399M/90.399M) 100% Centralized 0.9336 0.8926 0.8618 0.8851 0.9012 0.8989 0.8955
AttnFT Local 0.5103 0.8953 0.2937 0.3754 0.8844 0.8301 0.6315
{θIE-AT , θMD } Federated 0.9243 0.8863 0.8516 0.8362 0.8960 0.8839 0.8797
(29.575M/90.399M) 32.7% Centralized 0.9312 0.8997 0.8637 0.8768 0.9087 0.8943 0.8957
DecFT Local 0.9123 0.8297 0.8005 0.7734 0.8121 0.7922 0.8200
{θMD } Federated 0.9019 0.8427 0.7561 0.8097 0.7164 0.8337 0.8101
(3.768M/90.399M) 4.2% Centralized 0.9104 0.8549 0.8349 0.8348 0.8738 0.8433 0.8587
LoRAFT Local 0.8661 0.8856 0.8409 0.7769 0.8406 0.8218 0.8386
{θLoRA } Federated 0.9045 0.8632 0.8283 0.8510 0.8595 0.8402 0.8578
(1.368M/91.767M) 1.5% Centralized 0.9229 0.8985 0.8541 0.8699 0.9083 0.8822 0.8893
LoRADecFT Local 0.9355 0.9132 0.8639 0.8542 0.9063 0.8842 0.8929
{θMD , θLoRA } Federated 0.9283 0.8996 0.8615 0.8798 0.9065 0.8939 0.8949
(5.136M/91.767M) 5.7% Centralized 0.9416 0.9106 0.8606 0.8906 0.9130 0.9072 0.9039
PDecFT Local 0.8719 0.7825 0.7969 0.7308 0.7687 0.7830 0.7890
{θUP , θHYP } Federated 0.8526 0.7790 0.7176 0.7066 0.4350 0.7197 0.7017
(0.344M/90.399M) 0.4% Centralized 0.8783 0.7893 0.7853 0.7769 0.7901 0.7848 0.8008
FLAP-SAM (Ours) Local 0.9339 0.9037 0.8634 0.8206 0.9063 0.8791 0.8845
{θLoRA , θUP , θHYP } Federated 0.9336 0.8886 0.8516 0.8665 0.8991 0.8808 0.8867
(1.712M/91.767M) 1.9% Centralized 0.9392 0.9081 0.8636 0.8992 0.9103 0.9058 0.9044
MA-SAM‡
Centralized 0.945 0.922 0.877 0.897 0.923 0.910 0.912
(28.667M/115.297M) 25%
FLAP-SAM 13

Table B. Results on Fed-KiTS2019 dataset for different fine-tuning methods. ‡− FL

Experiments Dice score

Setting
(Trainable/Total params) SiteA SiteB SiteC SiteD SiteE SiteF Average
FullFT (Baseline) Local 0.4689 0.3830 0.4652 0.4566 0.4767 0.4454 0.4493
{θIE , θPE , θMD } Federated 0.6270 0.5258 0.5634 0.4800 0.5732 0.4967 0.5444
(90.533M/90.533M) 100% Centralized 0.6955 0.4273 0.5823 0.4705 0.5189 0.4698 0.5274
AttnFT Local 0.5074 0.4306 0.5153 0.4673 0.5052 0.4772 0.4838
{θIE-AT , θMD } Federated 0.6749 0.4942 0.6025 0.5534 0.5839 0.5252 0.5724
(30.186M/90.533M) 33.3% Centralized 0.7015 0.4789 0.5972 0.4999 0.5164 0.4977 0.5486
DecFT Local 0.3754 0.3933 0.4285 0.4475 0.4037 0.4791 0.4213
{θMD } Federated 0.5986 0.4404 0.5214 0.4627 0.5293 0.4882 0.5068
(3.902M/90.533M) 4.3% Centralized 0.6366 0.4288 0.5573 0.4732 0.5210 0.4902 0.5179
LoRAFT Local 0.3999 0.2885 0.3978 0.3816 0.3374 0.4247 0.3717
{θLoRA } Federated 0.3984 0.2888 0.3697 0.3906 0.3470 0.4179 0.3687
(1.368M/91.901M) 1.5% Centralized 0.6028 0.4671 0.5228 0.4995 0.4981 0.5551 0.5242
LoRADecFT Local 0.4800 0.5038 0.5607 0.4633 0.5177 0.5065 0.5053
{θMD , θLoRA } Federated 0.7193 0.5172 0.6155 0.5398 0.5861 0.6145 0.5987
(5.270M/91.901M) 5.8% Centralized 0.7322 0.4908 0.6068 0.6291 0.5918 0.6095 0.6100
PDecFT Local 0.2992 0.3466 0.3994 0.3845 0.3696 0.4589 0.3764
{θUP , θHYP } Federated 0.5236 0.3660 0.4513 0.4307 0.4769 0.4730 0.4536
(0.478M/90.533M) 0.5% Centralized 0.5649 0.3992 0.4964 0.4516 0.4868 0.4766 0.4793
FLAP-SAM (Ours) Local 0.5107 0.4173 0.5738 0.4657 0.5289 0.5451 0.5069
{θLoRA , θUP , θHYP } Federated 0.7234 0.5770 0.6169 0.5554 0.5762 0.5786 0.6046
(1.846M/91.901M) 2.0% Centralized 0.7041 0.4500 0.6292 0.5772 0.6201 0.6075 0.5980
MA-SAM‡
Centralized 0.725 0.544 0.605 0.513 0.566 0.661 0.602
(28.721M/116.605M) 25%
14 M. Asokan et al.

Table C. Results on Fed-IXI dataset for different fine-tuning methods. ‡− FL setting

of MA-SAM is not shown due to challenges in decomposing FacT tensors after federated
aggregation. The baseline (full fine-tuning) is highlighted in Blue and our method in
Pink . The best and second best methods for FL are highlighted in bold and underline,
respectively.

Experiments Dice score

Setting
(Trainable/Total params) SiteA SiteB SiteC Average
FullFT (Baseline) Local 0.9808 0.9797 0.9728 0.9777
{θIE , θPE , θMD } Federated 0.9837 0.9835 0.9761 0.9811
(90.399M/90.399M) 100% Centralized 0.9838 0.9846 0.9818 0.9834
AttnFT Local 0.9592 0.9708 0.7244 0.8848
{θIE-AT , θMD } Federated 0.9713 0.9701 0.9608 0.9674
(29.575M/90.399M) 32.7% Centralized 0.9781 0.9791 0.9749 0.9774
DecFT Local 0.9783 0.9787 0.9680 0.9750
{θMD } Federated 0.9780 0.9787 0.9746 0.9771
(3.768M/90.399M) 4.2% Centralized 0.9791 0.9806 0.9770 0.9789
LoRAFT Local 0.9788 0.9776 0.9620 0.9728
{θLoRA } Federated 0.9790 0.9791 0.9749 0.9777
(1.368M/91.767M) 1.5% Centralized 0.9803 0.9808 0.9785 0.9798
LoRADecFT Local 0.9850 0.9853 0.9784 0.9829
{θMD , θLoRA } Federated 0.9850 0.9853 0.9806 0.9836
(5.136M/91.767M) 5.7% Centralized 0.9855 0.9864 0.9837 0.9852
PDecFT Local 0.9713 0.9722 0.9600 0.9678
{θUP , θHYP } Federated 0.9697 0.9725 0.9658 0.9693
(0.344M/90.399M) 0.4% Centralized 0.9718 0.9729 0.9685 0.9711
FLAP-SAM (Ours) Local 0.9850 0.9849 0.9788 0.9829
{θLoRA , θUP , θHYP } Federated 0.9847 0.9852 0.9803 0.9834
(1.712M/91.767M) 1.9% Centralized 0.9854 0.9864 0.9835 0.9851
MA-SAM‡
Centralized 0.972 0.972 0.968 0.971
(28.667M/115.297M) 25%
FLAP-SAM 15

Table D. Results across local site test data for different rank values of LoRA in the
federated setting on Fed-KiTS19 dataset.

LoRA Dice score Trainable /

rank SiteA SiteB SiteC SiteD SiteE SiteF Average Total params
32 0.723 0.577 0.617 0.555 0.576 0.579 0.605 1.846M / 91.901M
16 0.720 0.572 0.616 0.518 0.561 0.604 0.599 1.162M / 91.217M
4 0.703 0.590 0.609 0.538 0.588 0.572 0.600 0.649M / 90.704M

Table E. Results across local test data site for different PEFT methods in the federated
setting on Fed-KiTS19 dataset.

Dice score Trainable /

Methods
SiteA SiteB SiteC SiteD SiteE SiteF Average Total params
LoRA 0.723 0.577 0.617 0.555 0.576 0.578 0.605 1.846M / 91.901M
DoRA 0.725 0.470 0.615 0.537 0.609 0.594 0.592 1.846M / 91.901M
MoLE 0.712 0.517 0.613 0.552 0.615 0.611 0.603 4.583M / 94.637M