Federated Learning: Training Models Without Moving Data
Federated learning flips the usual gravity of ML: instead of hauling sensitive data to one warehouse, it sends the model out like a traveling teacher and brings back only the lessons. This guide explains the math and the operational trade-offs.
The core insight
Move the computation to the data, not the data to the computation
Traditional ML has a data gravity problem: you collect data in one place, train one model, and expose it through an API. Federated Learning (McMahan et al., 2017) reverses this. The model travels to each participant, trains on local private data, and sends only weight updates — never raw samples — back to a central aggregator. The aggregator merges the updates into an improved global model and broadcasts it again. The raw data never leaves the originating device. This enables learning from data that is legally, physically, or commercially impossible to centralize.
Cross-device FL
Millions of mobile or IoT devices each contributing a small update per training round. Canonical example: Google Gboard next-word prediction trained on typing patterns without keystrokes leaving your phone.
Cross-silo FL
Tens to hundreds of institutional participants — hospitals training shared diagnostic models, banks detecting fraud patterns — with strict audit requirements and contractual agreements.
Why not just anonymize?
De-anonymization attacks can re-identify individuals from "anonymized" records with 80%+ success rate. FL provides a stronger guarantee because raw data never leaves the source at all.
The algorithm
FedAvg: weighted averaging closes the training loop
FedAvg is the canonical federated algorithm. Think of it as distributed SGD with local accumulation: the server broadcasts the global model, a random subset of clients run E epochs of local SGD on their private data, clients return their updated weights, and the server computes a weighted average proportional to dataset size. Visualize as a radial diagram: server at the hub, clients as spokes on two rings (inner = selected this round, outer = idle), gradient arrows pulsing inward during aggregation.
Client drift (non-IID)
When clients have heterogeneous data distributions, local updates pull in conflicting directions. Visualize as diverging arrows around the server hub. FedProx adds a proximal regularization term to limit how far local updates stray from the global model.
Communication compression
Gradient sparsification (top-K values only), random sketching, and 1-4 bit quantization reduce upload bandwidth by 100-1000x — critical for mobile devices on metered connections.
Asynchronous FL
Slow or intermittently connected clients stall synchronous rounds. Async aggregation accepts staleness-bounded updates, enabling participation from heterogeneous device pools at the cost of slightly noisier gradients.
Global model is the dataset-size-weighted mean of client updates. Clients with more data have proportionally more influence on the global model.
Each client runs E steps of local SGD starting from the same global weights. Higher E speeds convergence but amplifies client drift on non-IID data.
Provable privacy
Differential Privacy + Secure Aggregation: defense in depth
Gradient updates can leak private training data. Gradient inversion attacks can sometimes reconstruct full training images from a single gradient vector. Two complementary defenses combine into a mathematically auditable privacy guarantee. Visualize the epsilon budget as a depleting bar that decreases with each training round — when it reaches zero, training stops.
epsilon <= 8
Privacy target
100-1000x
Bandwidth reduction
DP-SGD: clip gradient to norm C (bounding sensitivity), then add calibrated Gaussian noise. Privacy cost (epsilon, delta) is tracked via the moments accountant across rounds.
The output distribution of a DP mechanism changes by at most e^epsilon between any two neighboring datasets. Lower epsilon = stronger privacy, higher utility cost.
Checklist
- Target epsilon <= 8.0 for consumer FL (Google and Apple production standard).
- Use Secure Aggregation (SecAgg) so the server sees only the sum of updates, never individual gradients.
- Track cumulative privacy budget per client across all training rounds — stop training when budget is exhausted.
- Audit for model inversion attacks quarterly using membership inference probes on your own global model.
System visualization
Three diagrams that explain FL to any audience
The most effective FL presentations use layered visuals: start with the radial network (the what), then the round timeline (the how), then the privacy budget chart (the guarantee). Each layer answers a different stakeholder question.
Radial client sync diagram
Server at center. Clients as nodes on two concentric rings: inner ring = selected this round (glowing), outer ring = idle. Gradient arrows pulse inward during aggregation. Model broadcast arrows radiate outward. Color clients by data distribution label: clusters = IID, scatter = non-IID risk.
Training round timeline
Horizontal Gantt chart with one row per client. Columns = rounds. Each cell shows: selected (green), training (yellow), uploading (blue), idle (grey). Makes stragglers and dropout clients immediately visible.
Privacy budget depletion
A vertical bar per client starting at epsilon_max. Each round decreases the bar by the per-round privacy cost. When any bar reaches zero, that client stops participating. Helps stakeholders understand the privacy-utility tradeoff at a glance.
Related posts
Neural Architectures Decoded: FFNN, RNN, and Transformers
Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.
10 min readData Warehouse, Data Lake, and Lakehouse: A Visual Architecture Guide
Warehouses, lakes, and lakehouses are really three answers to one question: when should raw data be forced into shape? This guide turns that architectural choice into concrete diagrams and decision rules.
10 min readResponsible AI: Safety, Fairness, and Trustworthy Systems
Getting a model to work is only the opening scene; the harder plot begins when it must stay fair, explainable, safe, and accountable under pressure. This guide maps the pillars and practices that keep trust from collapsing.
11 min read