Federated Learning: Training Models Without Moving Data
Back to blog
Distributed PrivacyPrivacy-Preserving ML

Federated Learning: Training Models Without Moving Data

Federated learning flips the usual gravity of ML: instead of hauling sensitive data to one warehouse, it sends the model out like a traveling teacher and brings back only the lessons. This guide explains the math and the operational trade-offs.

11 min readMarch 17, 2026
Federated LearningPrivacyDifferential PrivacyFedAvgEdge ML

The core insight

Move the computation to the data, not the data to the computation

Traditional ML has a data gravity problem: you collect data in one place, train one model, and expose it through an API. Federated Learning (McMahan et al., 2017) reverses this. The model travels to each participant, trains on local private data, and sends only weight updates — never raw samples — back to a central aggregator. The aggregator merges the updates into an improved global model and broadcasts it again. The raw data never leaves the originating device. This enables learning from data that is legally, physically, or commercially impossible to centralize.

Consumer

Cross-device FL

Millions of mobile or IoT devices each contributing a small update per training round. Canonical example: Google Gboard next-word prediction trained on typing patterns without keystrokes leaving your phone.

Enterprise

Cross-silo FL

Tens to hundreds of institutional participants — hospitals training shared diagnostic models, banks detecting fraud patterns — with strict audit requirements and contractual agreements.

Motivation

Why not just anonymize?

De-anonymization attacks can re-identify individuals from "anonymized" records with 80%+ success rate. FL provides a stronger guarantee because raw data never leaves the source at all.

The algorithm

FedAvg: weighted averaging closes the training loop

FedAvg is the canonical federated algorithm. Think of it as distributed SGD with local accumulation: the server broadcasts the global model, a random subset of clients run E epochs of local SGD on their private data, clients return their updated weights, and the server computes a weighted average proportional to dataset size. Visualize as a radial diagram: server at the hub, clients as spokes on two rings (inner = selected this round, outer = idle), gradient arrows pulsing inward during aggregation.

Core challenge

Client drift (non-IID)

When clients have heterogeneous data distributions, local updates pull in conflicting directions. Visualize as diverging arrows around the server hub. FedProx adds a proximal regularization term to limit how far local updates stray from the global model.

Efficiency

Communication compression

Gradient sparsification (top-K values only), random sketching, and 1-4 bit quantization reduce upload bandwidth by 100-1000x — critical for mobile devices on metered connections.

Scale

Asynchronous FL

Slow or intermittently connected clients stall synchronous rounds. Async aggregation accepts staleness-bounded updates, enabling participation from heterogeneous device pools at the cost of slightly noisier gradients.

wt+1=k=1Knknwt+1(k),n=knkw_{t+1} = \sum_{k=1}^{K} \frac{n_k}{n}\, w_{t+1}^{(k)}, \quad n = \sum_{k} n_k

Global model is the dataset-size-weighted mean of client updates. Clients with more data have proportionally more influence on the global model.

wt+1(k)=wtηLk(wt)w_{t+1}^{(k)} = w_t - \eta\, \nabla \mathcal{L}_k(w_t)

Each client runs E steps of local SGD starting from the same global weights. Higher E speeds convergence but amplifies client drift on non-IID data.

Provable privacy

Differential Privacy + Secure Aggregation: defense in depth

Gradient updates can leak private training data. Gradient inversion attacks can sometimes reconstruct full training images from a single gradient vector. Two complementary defenses combine into a mathematically auditable privacy guarantee. Visualize the epsilon budget as a depleting bar that decreases with each training round — when it reaches zero, training stops.

epsilon <= 8

Privacy target

100-1000x

Bandwidth reduction

g~k=clip(gk,C)+N ⁣(0,σ2C2I)\tilde{g}_k = \text{clip}(g_k, C) + \mathcal{N}\!\left(0,\,\sigma^2 C^2 \mathbf{I}\right)

DP-SGD: clip gradient to norm C (bounding sensitivity), then add calibrated Gaussian noise. Privacy cost (epsilon, delta) is tracked via the moments accountant across rounds.

M(D) is (ε,δ)-DP iff Pr[M(D)S]eεPr[M(D)S]+δM(\mathcal{D}) \text{ is } (\varepsilon,\delta)\text{-DP iff } \Pr[M(\mathcal{D})\in S] \leq e^{\varepsilon}\Pr[M(\mathcal{D}^{\prime})\in S] + \delta

The output distribution of a DP mechanism changes by at most e^epsilon between any two neighboring datasets. Lower epsilon = stronger privacy, higher utility cost.

Checklist

  • Target epsilon <= 8.0 for consumer FL (Google and Apple production standard).
  • Use Secure Aggregation (SecAgg) so the server sees only the sum of updates, never individual gradients.
  • Track cumulative privacy budget per client across all training rounds — stop training when budget is exhausted.
  • Audit for model inversion attacks quarterly using membership inference probes on your own global model.

System visualization

Three diagrams that explain FL to any audience

The most effective FL presentations use layered visuals: start with the radial network (the what), then the round timeline (the how), then the privacy budget chart (the guarantee). Each layer answers a different stakeholder question.

Radial client sync diagram

Server at center. Clients as nodes on two concentric rings: inner ring = selected this round (glowing), outer ring = idle. Gradient arrows pulse inward during aggregation. Model broadcast arrows radiate outward. Color clients by data distribution label: clusters = IID, scatter = non-IID risk.

Training round timeline

Horizontal Gantt chart with one row per client. Columns = rounds. Each cell shows: selected (green), training (yellow), uploading (blue), idle (grey). Makes stragglers and dropout clients immediately visible.

Privacy budget depletion

A vertical bar per client starting at epsilon_max. Each round decreases the bar by the per-round privacy cost. When any bar reaches zero, that client stops participating. Helps stakeholders understand the privacy-utility tradeoff at a glance.