FFSML - Thesis Presentation
FFSML - Thesis Presentation
Presented by,
Lahari Voleti
Contents • Introduction
• Related Work
• Our Proposed Methodology
• Datasets
• Experimental Designs
• Empirical Results
• Conclusions & Future Work
• References
Introduction
Introduction
• In recent years, mobile devices being a convenient way
of connecting users to internet, have become a platform
to deploy ML models.
Federated Learning
Wang, H. (2021). GitHub - wangshusen/DeepLearning.
GitHub. https://github.com/wangshusen/DeepLearning
Federated Learning
Li, Q. (2019, July 23). A Survey on Federated Learning Systems: Vision, Hype and Reality. https://arxiv.org/abs/1907.09693
Few-shot classification using Meta Learning
• Few-shot
learning means building
predictive models which can
efficiently solve the challenge of
prediction using only
limited amount of data for each
object class. N-way K-shot Q-query tasks (3-way 2-shot 1-query)
• Meta-learning is learning-to-
learn, a technique of making a
machine learning to make a
model more experienced on
a distribution of tasks and use this
experience to improve on
future learning performance.
• Goal is to make machine learning
models more human-like, reduce
data collection and improve
predictability.
Meta-Training
Yasd, J. (2018). GitHub - johnnyasd12/awesome-few-
shot-meta-learning: awesome few shot / meta learning
papers. GitHub.
Problem Statement and Solution Overview
• During the client local updates, for every round of training we minimize the loss function F using a proximal
term as follows:
• Where is the old loss, and is the proximal term which varies according to the data. is the client
local data weight parameters and is the global model parameters at time t.
• Advantage: specifically addresses and deals with the inconstant resource constraints of clients during
federated learning and the issue with heterogeneity of local data at the clients
Few-shot learning using meta-learning
• Relation Networks :
• It uses one module for generating
feature maps called embedding module
and relation module for calculating
relation score between support and
query images.
❖ Drawbacks: More cumbersome architecture
and adaptation to new few-shot tasks is of
minimal accuracy.
Tan, F. (2022). Learning to Compare: Relation Network for Few-shot Learning. https://medium.com/mlearning-
ai/learning-to-compare-relation-network-for-few-shot-learning-fa9c40c22701
Few-shot learning using meta-learning
Step 3: Individual clients fine-tuning with model M with their distinct support sets S1, S2, S3 and perform
prediction on their query sets Q1, Q2, Q3 using their respective fine-tuned model M '1, M '2, M '3.
(1) compare the performance of the following three federated learning approaches on few-shots:
Federated Averaging (FedAvg)
(2) explore the effect of data heterogeneity (using different datasets on different edge devices) on the few-shot
learning performance.
Datasets
Fashion-MNIST Dataset
• It is a dataset of Zalando’s article images which consists of 70,000 images from 10 different dataset.
The images are different types of clothes such as trousers, shirts etc. Every image is a 28 x 28 grayscale
image. [4]
Fashion-MNIST dataset
• 1623*20=32460 alphabets
CIFAR-100 dataset
• We also limit the number of shots in the few-shot learning tasks on the clients. We consider three few-shot learning task
configurations, namely:
❑ 3-way:5-shot:10 query
❑ 5-way:5-shot:10-query
❑ 5-way:5-shot:5-query
• The feature extractor we use in our Prototypical Networks is ResNet18, and optimization of loss is done using SGD.
• Software used: Google Collaboratory Pro+ with GPU (NVIDIA PT100) , 52 GB RAM
• Performance measure: Accuracy
• Number of rounds: 20
• Number of clients: either 2 or 3
Experimental Designs
Experimental Design of FedPer Algorithm
Out of these two configurations, We use the model with the first configuration for the rest of the experiments
(78 base and 42 personalized layers) which emphasizes less on the importance of local data.
Experimental Design of FedProx Algorithm
Question: What is the effect of proximal term in FedProx?
Effect of FedProx proximal term on Prediction Performance using Fashion-MNIST
Fashion-MNIST(2 clients)
Empirical Results
Part A: Single-Dataset Scenario (Homogeneous data)
Omniglot (Only 2 client scenarios) :
• In the below figures, we can see the sample individual round performances of
clients (green) and the aggregated server model(red).
5-5-10 5-5-5
2-clients 2-clients
Fashion-MNIST (Both 2 and 3-client scenarios):
CIFAR-100 (Both 2-client and 3-client scenarios ):
Empirical Results
Part B: Multiple-Dataset Scenario (Heterogeneous Data)
Important Investigations
1) How well the edge devices and server aggregated model performs on a completely new task
it hasn't seen before?
2) What is the performance of these edge devices and server aggregated model under heterogenous data?
Experimental Results-Multiple Dataset Scenario
Server Pre-Training Client Fine-Tune Accuracy Server Aggregated M' Testing
Accuracy Accuracy
Number N-way K- Q- Server Base M C1 C-2 C-3 M' on M' on M'
of clients & shot query Train Accu S1-Q1 S2-Q2 S3-Q3 S-Q S_Q S-Q
Type of algo CIFAR,OMNIGLOT CIFAR Omniglot FashionMNIST CIFAR Omniglot FashionMNIST
2 Fedavg 3 5 10 63.933 41.33 90.00 - 41.16 79.33 66.50
2 Fedavg 5 5 5 53.02 26.285 85.904 - 33.40 76.20 60.60
OMNIGLOT
5-5-5 5-5-5
Conclusions
• The main observations and conclusions from the empirical results on our proposed framework are as follows:
1) Varying the ratio of base and personalization layers has shown that a considerable amount of base layers is
good for the FedPer algorithm .
2) Varying the FedProx proximal term between 0.01 and 1.5 does not have a significant effect on the prediction
performance for our proposed approach using FedProx for federated learning.
3) For few-shot classification tasks with reasonable difficulty (> 50% accuracy), the proposed approach is able
to improve the edge devices’ individual prediction performance and improve significantly on the global
model (on the server) using any of the federated learning approaches when the few-shot tasks are from the
same datasets.
4) Unsurprisingly, the aggregated (global) models from FedPer perform the best most frequently, followed by
aggregated models from FedProx.
5) Data heterogeneity problem affects the prediction performance of our proposed solution no matter which
federated learning approach we used.
Future Work
❖ In our thesis, we perform experiments with the assumption that data used by the server and clients to
generate few-shot tasks are from all classes (even though the data for the server and clients are non-
overlapping). Future work include testing our proposed approach on experimental setting such that the server
and the clients have data from different classes (and non-overlapping).
❖ Improve our proposed framework to handle heterogenous data problem using transfer learning approaches.
References
[1] Wang, Y., Yao, Q., Kwok, J. T., & Ni, L. M. (2021). Generalizing from a Few Examples. ACM Computing Surveys, 53(3),
1–34. https://doi.org/10.1145/3386252
[2] Li, Q. (2019). A Survey on Federated Learning Systems: Vision, Hype and Reality. https://arxiv.org/abs/1907.09693
[3] Benchmarks CIFAR-100. (n.d.). [Database]. CIFAR-100; benchmarks. ai. https://benchmarks. ai/cifar-100
[4] Z. (2017). GitHub - zalandoresearch/fashion-mnist: A MNIST-like fashion product database. Benchmark [Dataset].
Retrieved from https://github.com/zalandoresearch/fashion-mnist
[5] Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program
induction (Vol. 350, Issue 6266). https://doi.org/10.1126/science.aab3050
[6] Snell, J. (2017). Prototypical Networks for Few-shot Learning. Advances in Neural Information Processing Systems 30
(NIPS 2017). Retrieved from https://papers.nips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42 -Abstract.html
[7] McMahan, B. H., Moore, E., Ramage, D., Hampson, S., & Arcas, B. A. (2016). Communication-Efficient Learning of
Deep Networks from Decentralized Data. arXiv:1602.05629 [Cs.LG]. Retrieved from https://arxiv.org/abs/1602.05629
[8] Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020). Federated Optimization In Heterogeneous
Networks. arXiv:1812.06127 [Cs.LG]. Retrieved from https://arxiv.org/pdf/1812.06127
[9] Arivazhagan, M. G., Agarwal, V., Singh, A. K., & Choudhary, S.(2019). Federated Learning with Personalization Layers.
arXiv:1912.00818 [Cs.LG]. Retrieved from https://arxiv.org/abs/1912.00818
Thank you
QUESTION?
Appendix: Additional
Results
Federated learning algorithms on Few-shot data of Fashion-MNIST
3-5-10
5-5-5
3-clients
2-clients
• In the below figures, we can see the sample individual round performances of clients
(green) and the aggregated server model(red).
5-5-10 5-5-5
3-clients 3-clients
Observations on Fashion-MNIST:
Omniglot:
3-5-10
5-5-5
3-clients
2-clients
• In the below figures, we can see the sample individual round performances of
clients (green) and the aggregated server model(red).
5-5-10 5-5-5
3-clients 3-clients
Observations on Single-Dataset Scenario
CIFAR-100:
1. For this dataset there is only very slight improvement (29% to 39%) in aggregated server testing performance in
case of 5-5-10 and 5-5-5 .
2. None of the three federated learning methods outperforms the others as their aggregated global models do not
consistently help improve clients’ predictive performance.
3. No consistent improvement (or convergence) in prediction performance for the few-shot learning task as more
rounds are iterated.
4. Time taken:
- When it comes to a 2-client prediction, each experimental trial takes about 275 seconds (FedPer for 3-5-10) to
452 seconds (Fedper of 5-5-10).
- In the case of 3-clients, an experimental trial takes 380 (FedPer 3-5-10) to 609 seconds (5-5-5 FedProx) .
Experimental Results-Multiple Dataset Scenario
Number N-way K-shot Q- Server Base M C1 C-2 C-3 Server Testing M' on M' on
of clients & query Train Accu S1-Q1 S2-Q2 S3-Q3 M' S-Q S_Q
Type of algo CIFAR,OMNIGLOT CIFAR Omniglot FashionMNIS T S -Q CIFAR Omniglot
FashionMNIS T
2 Fedavg 3 5 10 63.933 41.33 90.00 66.50 41.16 79.33
2 Fedavg 5 5 10 60.62 32.30 91.400 59.90 30.30 87.50
2 Fedavg 5 5 5 53.02 26.285 85.904 60.60 33.40 76.20
CIFAR-100
Comparing Fed Algorithms of Single-Data and Multiple-Data scenario- (5-5-5)
We investigate how the global model reacts to few-shot learning tasks on an unseen dataset, but only provided
relevant information to the client model fine-tuning.
SINGLE-DATA Scenario
MULTIPLE-DATA Scenario MULTIPLE-DATA Scenario
With and without F-MNIST
Observations:
-When the client has not been trained on Fashion-MNIST, server performance is not better.
-When the client has been trained on Fashion-MNIST, the server model trained using FedProx is the best
performer over all three aggregation algorithms.
Comparing on Single-Dataset Scenario and Multiple-data scenarios
1. Fashion-MNIST test accuracies for single-dataset scenarios are in the range of 80% to 86% where in case of
multiple-dataset scenarios it is only 60% to 70% in all few-shot learning task configurations.
2. CIFAR-100 under multiple-dataset scenarios has accuracies in the range of 27% to 33% in case of 5-5-5 and 5-5-
10 few-shot learning task configuration and 40% to 48% in case of 3-5-10 few-shot learning task configuration,
whereas in single-dataset case it is just 30% to 35% in all three few-shot learning task configurations.
3. Omniglot in single-dataset scenarios is more accurate in the range of 88% to 90% whereas in multiple-dataset
scenarios its accuracy is only between 71% and 87%.