2024 MTH058 Lecture07 FederatedLearning
2024 MTH058 Lecture07 FederatedLearning
Devices communicate
with a central server
periodically to learn a
global model.
FL helps preserve user
privacy and reduces strain
on the network by
keeping data localized.
3
Federated learning: A definition
Generate a global model shared by all nodes
Exchange
parameters
between
these local
nodes
5
Federated learning workflow
• Only the updated model will be sent to the server side.
• Any actual data based on user behavior is not necessarily
included.
6
Federated learning workflow
• The new model is aggregated on the server side and will be
distributed to all client devices.
7
Centralized vs. Decentralized techniques
8
Types of federated learning
• Centralized federated learning
• A central server coordinates all the participating nodes during
the learning process → possibly a bottleneck of the system.
9
10
Diao, Enmao, Jie Ding, and Vahid Tarokh. "Heterofl: Computation and
communication efficient federated learning for heterogeneous clients."
(Link, ICLR 2021)
12
Federated vs. Distributed learning
• Distributed learning aims at parallelizing computing power, while
Federated learning aims at training on heterogeneous datasets.
13
Canonical problem formulation
• FL was originally introduced as a new setting for distributed
optimization with a few distinctive properties.
14
Canonical problem formulation
• Objective: Learn a single global statistical model from data
stored on tens to potentially millions of remote devices.
• In particular, minimize the following objective function:
𝑲
𝒘∗ = 𝐚𝐫𝐠 𝐦𝐢𝐧 𝑭 𝒘 ≔ 𝒑𝒌 𝑭𝒌 𝒘
𝒘∈𝑹𝒅 𝒌=𝟏
• 𝐾: the total number of devices
• 𝐹𝑘 is the local objective function for the 𝑘th device, defined as the
empirical risk over local data.
• 𝑝𝑘 is the relative impact of each device, 𝑝𝑘 ≥ 0 and σ𝐾
𝑘=1 𝑝𝑘 = 1.
1 𝑛𝑘
• It is user-defined, usually 𝑝𝑘 = 𝐾 or 𝑝𝑘 = , where 𝑛 is the number of
𝑛
samples over all devices.
15
The FedAvg (or Local SGD) method
• The clients optimize their local objective functions for
multiple steps to obtain 𝜃𝑡𝑖 .
• Then, send the pseudo-gradients, ∆𝑡𝑖 = 𝜃 𝑡 − 𝜃𝑡𝑖 , to the server.
• 𝜃 𝑡 : initial state, 𝜃𝑡𝑖 : a local update at client 𝑖 at timestep 𝑡.
𝜽𝒕+𝟏 = 𝜽𝒕 + 𝜶𝒕 𝒑𝒊 ∆𝒕𝒊
𝒊=𝟏
16
FedAvg: The “client drift” problem
• Clients makes additional SGD steps locally → it converges much
faster both in the number of rounds and in wall-clock time.
17
FedAvg: The “client drift” problem
A toy 2D setting with two clients and quadratic objectives that illustrates the convergence
issues of FedAvg. Left: convergence trajectories in the parameter space. Right: convergence
in terms of distance from the global optimum. Each drawing of the plot corresponds to a
run of federated optimization from a different starting point in the parameter space.
More local SGD steps per round speed up training, but the progress eventually stagnates at
an inferior point further away from the global optimum.
Al-Shedivat, Maruan, Jennifer Gillenwater, Eric Xing, and Afshin Rostamizadeh. "Federated
learning via posterior averaging: A new perspective and practical algorithms." ICLR, 2021.
Image credit: CMU ML Blog 19
Federated Posterior Averaging (2021)
• FedPA uses stochastic gradient Markov chain Monte Carlo (SG-MCMC)
for approximate sampling from local posteriors on the clients
FedPA vs. FedAvg in the toy 2D setting with two clients and quadratic objectives.
21
Federated learning platforms
TensorFlow
Federated
Another application of federated learning for personal healthcare via learning over
heterogeneous electronic medical records distributed across multiple hospitals.
Image credit: CMU ML Blog 23
List of references
24
25