DevOps to MLOps: Building the Shared Delivery Muscle
DevOps taught teams to ship code like a disciplined factory line; MLOps adds a third moving part, data, and suddenly the factory floor shifts under your feet. This guide shows what transfers cleanly and what breaks.
The new variable
Code + Config was two variables. Data makes it three.
In standard DevOps, code + config always produces the same artifact deterministically. Add a training dataset and that guarantee breaks. Two identical codebases trained on different data snapshots produce different models. That single fact explains why every DevOps practice needs a data-aware twin in MLOps. Visualize as a Venn diagram: the DevOps circle (CI/CD, IaC, observability) and the MLOps circle (data versioning, model registry, drift monitoring) share a large intersection — but the non-overlapping parts are where teams get surprised.
What DevOps gives you free
Version control, CI pipelines, containerization, blue/green deploys, and observability dashboards — all transfer unchanged to model serving. Use them; do not reinvent them.
What MLOps adds
Data versioning, feature stores, experiment tracking, model registries, and drift-triggered retraining pipelines. Bolt these onto existing CI/CD; they extend the loop, not replace it.
Where teams get surprised
Models degrade silently without retraining, fail on unseen distributions, and encode training-set biases. Unlike a crashing service, a drifting model gives no stack trace — only slow KPI decline.
CALMS applied to ML
Five pillars — all five have direct ML analogs
CALMS is DevOps culture in an acronym. Animate as five concentric rings lighting up progressively as a team matures through each pillar. Skipping any ring leaves a gap that surfaces as toil or outages later.
Culture
Blameless postmortems for model failures. Shared on-call across SWE and DS — the team that trains the model also pages for it. Pair reviews across discipline boundaries reduce "throw over the wall" launches.
Automation
IaC for training clusters, pipeline-as-code for data and model workflows, automated evaluation gates. Every step reproducible from a single git SHA. Animate the pipeline DAG lighting up green as each stage passes.
Lean
Shadow deployments before production traffic. Offline A/B evaluation before any canary. Minimize experiment batch size for faster feedback loops. Every step should produce a measurable signal before the next step starts.
Measurement
Track DORA metrics (deployment frequency, lead time, change failure rate, MTTR) alongside ML-specific metrics (PSI, accuracy drift, feature freshness). One dashboard, both signal families.
Sharing
Model cards communicate assumptions, limitations, and known failure modes to downstream consumers. Internal demo days and shared registries prevent duplicate work and silent assumptions.
Four DORA metrics, ML edition
Track the same four metrics for both artifacts: code and model
DORA metrics were validated across 33,000 software teams — they apply directly to ML. The key insight: track them for two parallel pipelines and overlay them on one dashboard. The bottleneck is always visible as the wider gap. Visualize as a dual-lane funnel: the left lane is code deploys, the right lane is model deploys, narrowing together toward production.
Deployment Frequency
Goal: weekly+How often do you ship a new model version to production? Elite ML teams retrain weekly or on drift events. Visualize as a frequency histogram updated in real time — gaps reveal blocked pipelines.
Lead Time for Change
Goal: < 24 hFrom dataset snapshot to trained artifact to deployed endpoint. Cached feature stores, containerized trainers, and evaluation automation compress this to hours. Color-code each stage as a Gantt strip.
Change Failure Rate
Goal: < 10 %Percentage of model deployments that degrade KPIs or trigger a rollback. Automated quality gates checking PSI, AUC delta, and P99 latency keep this below 10 %. Chart as a rolling 30-day percentage.
Mean Time to Restore
Goal: < 1 hHow quickly does the team recover after a bad model deploy? Blue/green routing and a warm fallback model (serving 5 % of traffic at all times) cut MTTR from hours to minutes.
< 24 h
Lead time target
< 10 %
Change failure rate
Reliability in numbers
Put concrete targets on what "reliable" means
Engineering SLO conversations improve dramatically when targets are written as equations. These two formulas appear on every SRE runbook — apply them equally to model serving. Present them in a leadership briefing to align technical and business stakeholders on the same language.
Track both median (typical case) and P95 (long tail). P95 spikes reveal data validation failures or resource contention — the usual culprits in ML pipelines.
ML MTTR often inflates because teams must diagnose whether the failure is data, model, or infrastructure before they can fix it. Runbooks with decision trees cut this time in half.
Checklist
- Run weekly error-budget reviews with separate budgets for software SLOs and model quality SLOs.
- Maintain a warm fallback model serving 5 % of traffic — rollback becomes a router config change, not a redeploy.
- Write incident runbooks as decision trees: "Is PSI > 0.25? Roll back. Is AUC delta > 3 %? Page DS on-call."
- Conduct quarterly DR drills: simulate losing the feature store, the model registry, and the training cluster sequentially.
Reference architecture
Three planes, clear ownership, explicit interfaces
Structure the shared DevOps/MLOps stack into three planes — each with a distinct owner and a versioned interface to the others. Animate as layered tiles that slide into place on hover, revealing team boundaries and data-flow arrows between planes.
Data Plane
Owned by Data Engineering. Ingestion pipelines, feature store, labeling queues, quality monitors. Interface to the Model Plane: a versioned feature API with freshness SLA — visualize as a glowing connector between tiles.
Model Plane
Owned by ML Engineering. Training services, experiment tracker, evaluation harness, model registry. Interface to the Delivery Plane: a versioned model artifact + evaluation report card. Animate lineage arrows flowing upward.
Delivery Plane
Owned by Platform/SRE. CI/CD, canary router, serving infrastructure, observability mesh. Interface back to Data and Model Planes: a drift alert stream that automatically triggers remediation jobs.
Related posts
MLOps Systems Blueprint for Reliable AI
Production ML behaves like a three-body problem: code, data, and live behavior all pull in different directions. This guide shows how to turn that motion into a stable, self-correcting delivery loop.
9 min readData Warehouse, Data Lake, and Lakehouse: A Visual Architecture Guide
Warehouses, lakes, and lakehouses are really three answers to one question: when should raw data be forced into shape? This guide turns that architectural choice into concrete diagrams and decision rules.
10 min readNeural Architectures Decoded: FFNN, RNN, and Transformers
Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.
10 min read