Delivery CultureDevOps · MLOps

DevOps to MLOps: Building the Shared Delivery Muscle

DevOps taught teams to ship code like a disciplined factory line; MLOps adds a third moving part, data, and suddenly the factory floor shifts under your feet. This guide shows what transfers cleanly and what breaks.

10 min readMarch 17, 2026

DevOpsMLOpsDORACALMSAutomation

The new variable

Code + Config was two variables. Data makes it three.

In standard DevOps, code + config always produces the same artifact deterministically. Add a training dataset and that guarantee breaks. Two identical codebases trained on different data snapshots produce different models. That single fact explains why every DevOps practice needs a data-aware twin in MLOps. Visualize as a Venn diagram: the DevOps circle (CI/CD, IaC, observability) and the MLOps circle (data versioning, model registry, drift monitoring) share a large intersection — but the non-overlapping parts are where teams get surprised.

Inherited

What DevOps gives you free

Version control, CI pipelines, containerization, blue/green deploys, and observability dashboards — all transfer unchanged to model serving. Use them; do not reinvent them.

Extended

What MLOps adds

Data versioning, feature stores, experiment tracking, model registries, and drift-triggered retraining pipelines. Bolt these onto existing CI/CD; they extend the loop, not replace it.

Risk

Where teams get surprised

Models degrade silently without retraining, fail on unseen distributions, and encode training-set biases. Unlike a crashing service, a drifting model gives no stack trace — only slow KPI decline.

CALMS applied to ML

Five pillars — all five have direct ML analogs

CALMS is DevOps culture in an acronym. Animate as five concentric rings lighting up progressively as a team matures through each pillar. Skipping any ring leaves a gap that surfaces as toil or outages later.

People

Culture

Blameless postmortems for model failures. Shared on-call across SWE and DS — the team that trains the model also pages for it. Pair reviews across discipline boundaries reduce "throw over the wall" launches.

Pipeline

Automation

IaC for training clusters, pipeline-as-code for data and model workflows, automated evaluation gates. Every step reproducible from a single git SHA. Animate the pipeline DAG lighting up green as each stage passes.

Flow

Lean

Shadow deployments before production traffic. Offline A/B evaluation before any canary. Minimize experiment batch size for faster feedback loops. Every step should produce a measurable signal before the next step starts.

Metrics

Measurement

Track DORA metrics (deployment frequency, lead time, change failure rate, MTTR) alongside ML-specific metrics (PSI, accuracy drift, feature freshness). One dashboard, both signal families.

Enablement

Sharing

Model cards communicate assumptions, limitations, and known failure modes to downstream consumers. Internal demo days and shared registries prevent duplicate work and silent assumptions.

Four DORA metrics, ML edition

Track the same four metrics for both artifacts: code and model

DORA metrics were validated across 33,000 software teams — they apply directly to ML. The key insight: track them for two parallel pipelines and overlay them on one dashboard. The bottleneck is always visible as the wider gap. Visualize as a dual-lane funnel: the left lane is code deploys, the right lane is model deploys, narrowing together toward production.

Deployment Frequency

Goal: weekly+

How often do you ship a new model version to production? Elite ML teams retrain weekly or on drift events. Visualize as a frequency histogram updated in real time — gaps reveal blocked pipelines.

Lead Time for Change

Goal: < 24 h

From dataset snapshot to trained artifact to deployed endpoint. Cached feature stores, containerized trainers, and evaluation automation compress this to hours. Color-code each stage as a Gantt strip.

Change Failure Rate

Goal: < 10 %

Percentage of model deployments that degrade KPIs or trigger a rollback. Automated quality gates checking PSI, AUC delta, and P99 latency keep this below 10 %. Chart as a rolling 30-day percentage.

Mean Time to Restore

Goal: < 1 h

How quickly does the team recover after a bad model deploy? Blue/green routing and a warm fallback model (serving 5 % of traffic at all times) cut MTTR from hours to minutes.

< 24 h

Lead time target

< 10 %

Change failure rate

Reliability in numbers

Put concrete targets on what "reliable" means

Engineering SLO conversations improve dramatically when targets are written as equations. These two formulas appear on every SRE runbook — apply them equally to model serving. Present them in a leadership briefing to align technical and business stakeholders on the same language.

LeadTime = t_{\text{deploy}} - t_{\text{commit}}

Track both median (typical case) and P95 (long tail). P95 spikes reveal data validation failures or resource contention — the usual culprits in ML pipelines.

\text{MTTR} = \frac{1}{n}\sum_{i=1}^{n}\left(t_{\text{restore},i} - t_{\text{incident},i}\right)

ML MTTR often inflates because teams must diagnose whether the failure is data, model, or infrastructure before they can fix it. Runbooks with decision trees cut this time in half.

Checklist

Run weekly error-budget reviews with separate budgets for software SLOs and model quality SLOs.
Maintain a warm fallback model serving 5 % of traffic — rollback becomes a router config change, not a redeploy.
Write incident runbooks as decision trees: "Is PSI > 0.25? Roll back. Is AUC delta > 3 %? Page DS on-call."
Conduct quarterly DR drills: simulate losing the feature store, the model registry, and the training cluster sequentially.

Reference architecture

Three planes, clear ownership, explicit interfaces

Structure the shared DevOps/MLOps stack into three planes — each with a distinct owner and a versioned interface to the others. Animate as layered tiles that slide into place on hover, revealing team boundaries and data-flow arrows between planes.

Data Plane

Owned by Data Engineering. Ingestion pipelines, feature store, labeling queues, quality monitors. Interface to the Model Plane: a versioned feature API with freshness SLA — visualize as a glowing connector between tiles.

Model Plane

Owned by ML Engineering. Training services, experiment tracker, evaluation harness, model registry. Interface to the Delivery Plane: a versioned model artifact + evaluation report card. Animate lineage arrows flowing upward.

Delivery Plane

Owned by Platform/SRE. CI/CD, canary router, serving infrastructure, observability mesh. Interface back to Data and Model Planes: a drift alert stream that automatically triggers remediation jobs.

MLOps

MLOps Systems Blueprint for Reliable AI

Production ML behaves like a three-body problem: code, data, and live behavior all pull in different directions. This guide shows how to turn that motion into a stable, self-correcting delivery loop.

9 min read

Data Architecture

Data Warehouse, Data Lake, and Lakehouse: A Visual Architecture Guide

Warehouses, lakes, and lakehouses are really three answers to one question: when should raw data be forced into shape? This guide turns that architectural choice into concrete diagrams and decision rules.

10 min read

Deep Learning

Neural Architectures Decoded: FFNN, RNN, and Transformers

Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.

10 min read

All articles