3D_Segmentation
3D_Segmentation
Trainable
Dice score after Transmission overhead for
Server /
25-rounds of FedAvg
3D INPUT SCAN
Prompt Enc
SAM [+8.86%] 23.60GB (2.8 × )
Adapter Decoder
50.70% Full Decoder FT
DeConV HYP
[+8.55%] 2.91GB (23.2 × )
Client A 45.40% Partial Decoder FT
SEGMENTATION
MASK [+7.72%] 0.36GB (187.4 × )
3D INPUT SCAN 3D INPUT SCAN
36.90% LoRA FT
[ 0.30%] 1.02GB (66.1 × )
59.80% LoRA + Decoder FT
Adapter
SAM
Adapter
SAM [+9.34%] 3.93GB (17.2 × )
Encoder Encoder
60.50% FLAP-SAM (Ours)
[+9.77%] 1.38GB (48.9 × )
Prompt Enc
Prompt Enc
SAM SAM
[+/ ] indicates (× ) indicates reduction
Adapter Decoder Adapter Decoder
Collaborative gains in Communication Cost
in Dice score for compared to Full FT
FedAvg compared to
DeConV HYP DeConV HYP
Fig. 1. (Left) The proposed FLAP-SAM framework and (Right). The comparison (in
terms of Dice score(%), Collaborative gains(%) and Communication Cost(× ↓)) of our
method against other fine-tuning methods on Fed-KITS2019[26].
1 Introduction
Segmentation is one of the cornerstone tasks in modern medical image analysis
for automated diagnosis and disease monitoring. While the advent of founda-
tional models has pushed the boundaries of the state-of-the-art in many com-
puter vision applications, such benefits are yet to transfer fully to the medical
imaging domain [33]. For example, the Segment Anything Model (SAM) [14] has
an excellent zero-shot generalization to new distributions and tasks involving
natural images. However, the SAM model fails to generalize well across diverse
medical imaging modalities due to the insurmountable distribution shifts [11].
Works like [6,10] highlight a substantial performance gap between the zero-shot
inference and training on domain-specific medical images despite using various
prompts in SAM. MSA [29] and SAM-Med2D [4] improve SAM by using tailored
prompting techniques in 2D medical images. However, creating such prompts for
each 2D slice of 3D data is labor-intensive.
The usual approach has been fine-tuning SAM for the target application
using large task-specific datasets e.g., MedSAM[20]. For tasks with significant
distribution shifts with limited data, the decoder block of SAM is fine-tuned
while leaving the encoder untouched. In contrast to fine-tuning all parameters,
parameter-efficient fine-tuning(PEFT) methods fine-tune only a minimal num-
ber of parameters using representative data such as prompt tuning [12] and
low-rank adapters (LoRA) [9]. Several works [22,28,32] have employed LoRA for
fine-tuning, resulting in superior performance in various 2D segmentation tasks.
For 3d segmentation, [2] uses factor tuning (FacT) adapters [13] for training.
FLAP-SAM 3
However, accessing diverse medical data for such finetuning is not always feasi-
ble because relevant datasets specific to a medical task may not be available with
any single entity. It is often spread across multiple institutions and cannot be
shared due to privacy and confidentiality constraints [26]. While the data decen-
tralization issue can be handled effectively with federated learning (FL) [21,16],
the ginormous parameter size of SAM (∼100M-700M) makes it impractical for
FL as it imposes a substantial communication cost.
In this work, we uncover the importance of fine-tuning various components
of SAM when adapted for 3D medical segmentation and propose FLAP-SAM,
a PEFT approach involving LoRA that is amenable to FL (see Fig 1). More-
over, high parameter efficiency reduces communication costs in FL and prevents
overfitting in data-limited scenarios. With methods like MA-SAM[2], which uses
FacT adapters for 3D segmentation, it is challenging to federate because of the
tensor decomposition involved. Hence, we employ LoRA, which is FL-friendly,
to customize SAM for 3D medical image segmentation. SAMed[32] uses a similar
approach where LoRA and the entire SAM decoder are fine-tuned for 2D seg-
mentation. Our approach selectively finetunes certain decoder parts, reducing
parameters and communication costs.
2 Preliminaries
attention from image embeddings to tokens (i2t)), θMD-UP denotes the parameters
of the transposed convolutional layers used for upscaling, and θMD-HYP represents
the parameters of HYP.
PEFT Formulation: The most straightforward approach to adapt a SAM
model for a downstream task is to fine-tune all its parameters, including θIE , θPE ,
and θMD . This full fine-tuning (FullFT) strategy requires more memory footprint
to store a copy of all the updated parameters and often leads to overfitting when
the data is severely limited. Recent works have shown that fine-tuning only the
attention layers of a transformer encoder is sufficient for good adaptation [27].
This approach is called attention fine-tuning (AttnFT), where only θIE-AT and
θMD are updated. Typically, the attention-related parameters of a transformer en-
coder constitute one-third of its overall parameters. Thus, AttnFT leads to some
improvement in parameter efficiency and generally provides good performance
on downstream tasks. Another common approach is to freeze the image encoder
and fine-tune the entire mask decoder. This is because the cross-attention layers
in the decoder focus on specific patches in the image embeddings corresponding
to the prompts and transform them into segmentation predictions [31]. We call
this approach DecFT, where θMD is updated. It is also possible to freeze all the
parameters and fine-tune only the output layers, namely, UP and HYP. This ap-
proach is analogous to linear probing in classification tasks, which we refer to as
partial decoder fine tuning (PDecFT). Since only a small fraction of parameters,
namely, θMD-UP , and θMD-HYP are updated, PDecFT has high parameter efficiency
but usually provides only sub-optimal performance.
LoRA adapters: Low-rank adaptation [9] is a promising PEFT technique
widely used for adapting foundation models to downstream tasks. Each atten-
tion layer within a transformer block has four weight (projection) matrices Wq
′
(query), Wk (key), Wv (value), and Wo (output), where each W ∈ Rd×d . The
core idea of LoRA is to constrain the modifications to a pre-trained weight ma-
trix W to a linear update matrix ∆W , which can be further constrained using
′
a low-rank decomposition, i.e., ∆W = BA, where B ∈ Rd×r , A ∈ Rr×d and
the rank r ≪ min{d, d′ }. This approach effectively reduces the parameter space
while preserving the essential information needed for adaptation. W is frozen
during fine-tuning, and only A and B matrices are updated. For input x, the
output x̃ is computed as follows:
where α is a scale parameter. Following [9], we employ LoRA only to the pro-
jection matrices Wq and Wv in all the Γ attention layers of IE and MD. Let θLoRA
represent the set of all LoRA parameters, where θLoRA = {Aℓq , Aℓv , Bqℓ , Bvℓ }Γℓ=1 .
When only the LoRA parameters are updated during fine-tuning, we refer to
this case as LoRAFT. While this approach also has good parameter efficiency,
it typically results in sub-optimal performance for segmentation tasks. To im-
prove this, in SAMed[32], both the decoder and LoRA are fine-tuned. Here, θMD
and θLoRA parameters are updated together, and we represent this approach as
LoRADecFT.
FLAP-SAM 5
where Θkn denotes the local model of k th client in round n and αk is the weight
assigned for each client.
3D Input Segmentation
Scan Masks
Attn block
Multi-Head
Self Attn A A
Wq Wk Wv
I2T B B
MLP
Cross Attn
Layer Norm
T2I
Cross Attn Scaled Dot
Feed Forward Product Attention
Self
x12 x2 Attn Wo
Image Embedding
Prompt Encoder
(Automatic mode)
Fig. 2. Architecture of the proposed plug-and-play adapter. Only the decoder output
layers and LoRAs are fine-tuned, while the remaining parameters in SAM are frozen.
4 Experiments
(all data pooled) settings. We test each site data separately for all fine-tuning
strategies, and the results are tabulated in Table 1.
Table 2. Average Dice score across local test data in the federated setting on Fed-
KiTS19 dataset. (Left) different rank values of LoRA; (Right) different low-rank
adapter methods.
5 Conclusion
In this work, we have tackled adapting a foundational segmentation model
(SAM) for 3D medical image segmentation by incorporating an effective PEFT
FLAP-SAM 9
strategy. We critically analyze the LoRA adapter’s impact and various SAM
components to make the fine-tuning for dense 3D segmentation tasks amenable
to FL. Our approach simultaneously addresses data scarcity, overfitting, and
communication overhead challenges, resulting in a practical and cost-efficient
solution. Our current work analyses various fine-tuning methods in the context
of FedAVG[21]; an interesting future direction would be studying the effects of
various federated optimization strategies on low-rank adapters for datasets with
considerable distribution shifts.
References
1. Bloch, N., Madabhushi, A., Huisman, H., Freymann, J., Kirby, J., Grauer, M.,
Enquobahrie, A., Jaffe, C., Clarke, L., Farahani, K.: Nci-isbi 2013 challenge: auto-
mated segmentation of prostate structures. The Cancer Imaging Archive 370(6),
5 (2015)
2. Chen, C., Miao, J., Wu, D., Yan, Z., Kim, S., Hu, J., Zhong, A., Liu, Z., Sun,
L., Li, X., et al.: Ma-sam: Modality-agnostic sam adaptation for 3d medical image
segmentation. arXiv preprint arXiv:2309.08842 (2023)
3. Chen, G., Liu, F., Meng, Z., Liang, S.: Revisiting parameter-efficient tuning: Are
we really there yet? In: Proceedings of the 2022 Conference on Empirical Methods
in Natural Language Processing. pp. 2612–2626 (2022)
4. Cheng, D., Qin, Z., Jiang, Z., Zhang, S., Lao, Q., Li, K.: Sam on medical images:
A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035
(2023)
5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. In:
ICLR (2021)
6. He, S., Bao, R., Li, J., Grant, P.E., Ou, Y.: Accuracy of segment-anything model
(sam) in medical image segmentation tasks. arXiv preprint arXiv:2304.09324
(2023)
7. Heller, N., Isensee, F., Maier-Hein, K.H., Hou, X., Xie, C., Li, F., Nan, Y., Mu,
G., Lin, Z., Han, M., et al.: The state of the art in kidney and kidney tumor
segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge.
Medical Image Analysis p. 101821 (2020)
8. Heller, N., Sathianathen, N., Kalapara, A., Walczak, E., Moore, K., Kaluzniak,
H., Rosenberg, J., Blake, P., Rengel, Z., Oestreich, M., et al.: The kits19 challenge
data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and
surgical outcomes. arXiv preprint arXiv:1904.00445 (2019)
9. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)
10. Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J.,
Chen, J., Chen, C., Liu, S., Chi, H., Hu, X., Yue, K., Li, L., Grau, V., Fan, D.P.,
Dong, F., Ni, D.: Segment anything model for medical images? Medical Image
Analysis 92, 103061 (2024)
10 M. Asokan et al.
11. Huix, J.P., Ganeshan, A.R., Haslum, J.F., Söderberg, M., Matsoukas, C., Smith,
K.: Are natural domain foundation models useful for medical image classification?
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision. pp. 7634–7643 (2024)
12. Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.:
Visual prompt tuning. In: ECCV. pp. 709–727. Springer (2022)
13. Jie, S., Deng, Z.H.: Fact: Factor-tuning for lightweight adaptation on vision trans-
former. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37,
pp. 1060–1068 (2023)
14. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp.
4015–4026 (2023)
15. Lemaı̂tre, G., Martı́, R., Freixenet, J., Vilanova, J.C., Walker, P.M., Meriaudeau,
F.: Computer-aided detection and diagnosis for prostate cancer based on mono
and multi-parametric mri: A review. Computers in Biology and Medicine 60, 8–31
(2015)
16. Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated
optimization in heterogeneous networks. Proceedings of Machine learning and sys-
tems 2, 429–450 (2020)
17. Litjens, G., Toth, R., Van De Ven, W., Hoeks, C., Kerkstra, S., Van Ginneken,
B., Vincent, G., Guillard, G., Birbeck, N., Zhang, J., et al.: Evaluation of prostate
segmentation algorithms for mri: the promise12 challenge. Medical image analysis
18(2), 359–373 (2014)
18. Liu, Q., Dou, Q., Heng, P.A.: Shape-aware meta-learning for generalizing prostate
mri segmentation to unseen domains. In: MICCAI (2020)
19. Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T.,
Chen, M.H.: Dora: Weight-decomposed low-rank adaptation. arXiv preprint
arXiv:2402.09353 (2024)
20. Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical
images. Nature Communications 15(1), 654 (2024)
21. McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.:
Communication-efficient learning of deep networks from decentralized data. In:
Artificial intelligence and statistics. pp. 1273–1282. PMLR (2017)
22. Paranjape, J.N., Nair, N.G., Sikder, S., Vedula, S.S., Patel, V.M.: Adaptivesam:
Towards efficient tuning of sam for surgical scene segmentation. arXiv preprint
arXiv:2308.03726 (2023)
23. Pérez-Garcı́a, F., Sparks, R., Ourselin, S.: Torchio: a python library for efficient
loading, preprocessing, augmentation and patch-based sampling of medical images
in deep learning. Computer Methods and Programs in Biomedicine 208, 106236
(2021)
24. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: MICCAI 2015: 18th International Conference, Munich,
Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
25. Team, B.D.: Ixi-dataset, https://brain-development.org/ixi-dataset/
26. Ogier du Terrail, J., Ayed, S.S., Cyffers, E., Grimberg, F., He, C., Loeb, R., Man-
gold, P., Marchand, T., Marfoq, O., Mushtaq, E., et al.: Flamby: Datasets and
benchmarks for cross-silo federated learning in realistic healthcare settings. Ad-
vances in Neural Information Processing Systems 35, 5315–5334 (2022)
27. Touvron, H., Cord, M., El-Nouby, A., Verbeek, J., Jégou, H.: Three things everyone
should know about vision transformers. In: European Conference on Computer
Vision. pp. 497–515. Springer (2022)
FLAP-SAM 11
28. Wang, A., Islam, M., Xu, M., Zhang, Y., Ren, H.: Sam meets robotic surgery: An
empirical study in robustness perspective. arXiv preprint arXiv:2304.14674 (2023)
29. Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical
sam adapter: Adapting segment anything model for medical image segmentation.
arXiv preprint arXiv:2304.12620 (2023)
30. Wu, X., Huang, S., Wei, F.: Mole: Mixture of lora experts. In: The Twelfth Inter-
national Conference on Learning Representations (2023)
31. Xie, W., Willems, N., Patil, S., Li, Y., Kumar, M.: Sam fewshot finetuning for
anatomical segmentation in medical images. In: Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision. pp. 3253–3261 (2024)
32. Zhang, K., Liu, D.: Customized segment anything model for medical image seg-
mentation. arXiv preprint arXiv:2304.13785 (2023)
33. Zhang, S., Metaxas, D.: On the challenges and perspectives of foundation models
for medical image analysis. Medical Image Analysis 91, 102996 (2024)
12 M. Asokan et al.
A Appendix
Table D. Results across local site test data for different rank values of LoRA in the
federated setting on Fed-KiTS19 dataset.
Table E. Results across local test data site for different PEFT methods in the federated
setting on Fed-KiTS19 dataset.