Data ArchitectureData Architecture

Data Warehouse, Data Lake, and Lakehouse: A Visual Architecture Guide

Warehouses, lakes, and lakehouses are really three answers to one question: when should raw data be forced into shape? This guide turns that architectural choice into concrete diagrams and decision rules.

10 min readMarch 17, 2026

Data WarehouseData LakeLakehouseDelta LakeAnalytics

One key question

Schema-on-write vs. schema-on-read: where does structure live?

The entire Data Warehouse vs. Data Lake debate reduces to a single architectural decision: do you enforce structure when data arrives (schema-on-write), or when data is queried (schema-on-read)? Schema-on-write guarantees query performance and data quality at the cost of ingestion flexibility. Schema-on-read maximizes ingestion speed and handles raw, heterogeneous data at the cost of query-time overhead and potential quality surprises. The Lakehouse pattern attempts to get both.

Structured

Data Warehouse (schema-on-write)

Structured, modeled data. ETL pipelines transform and load into defined schemas before storage. Optimized for fast analytical queries via columnar storage and pre-built aggregates. Examples: Snowflake, BigQuery, Redshift.

Raw

Data Lake (schema-on-read)

Raw data in native format — CSV, JSON, Parquet, images, audio. Store everything, model it later. Flexible and cheap storage. Risk: "data swamp" if governance is absent. Examples: S3, ADLS, GCS.

Unified

Lakehouse (both)

Open table formats (Delta Lake, Iceberg, Hudi) add ACID transactions, schema evolution, and time travel on top of object storage. One copy of data serves both BI dashboards and ML training jobs.

Medallion architecture

Bronze, Silver, Gold: the three-layer data quality staircase

The Medallion (or Delta) architecture is the most widely adopted pattern for organizing a Lakehouse. Data enters raw, gets progressively cleaned and enriched, and finally arrives in a form optimized for consumption. Visualize as three horizontal layers with data flowing upward through refinement stages, each layer labeled with quality guarantees and latency characteristics.

Bronze (Raw landing)

Near real-time

Raw ingestion from all sources: event streams, API webhooks, database CDC feeds, file drops. Data is stored exactly as received with an ingestion timestamp. No transformations, no quality checks. Append-only, fully auditable.

Silver (Validated & conformed)

Minutes to hours

Cleansed, deduplicated, and schema-enforced. PII masked or tokenized. Referential integrity checks applied. This layer is the single source of truth for data engineers — all downstream pipelines read from here.

Gold (Aggregated & domain-specific)

Scheduled refresh

Pre-aggregated metrics, feature tables for ML, domain-specific denormalized views for BI tools. Optimized for specific consumption patterns. One Gold table per team or product area.

~$23/TB/mo

Storage cost

100ms – 10s

Query latency

Decision framework

Choosing between warehouse, lake, and lakehouse

Map your requirements to the right paradigm using this MECE decision tree. The three options are non-overlapping in their primary strengths — choose the one that matches your dominant constraint, then use the others for their secondary strengths.

Analytics-first

Use a Warehouse when...

Your primary consumers are SQL analysts and BI dashboards. Data is structured and well-understood. Query performance SLAs are strict. Budget allows for managed compute. Governance and access control are paramount.

Storage-first

Use a Lake when...

You need to store raw, heterogeneous data (logs, images, sensor streams) before you know how it will be used. ML training jobs need direct file access. Cost per byte is the primary constraint.

Unified-first

Use a Lakehouse when...

You have both analytical (SQL) and ML workloads reading the same data. You need ACID transactions and time travel for data corrections. You want one storage layer instead of two synced systems.

Data flow animation

Animate raw events entering Bronze as a stream of colored dots. On Silver layer, filter out red (invalid) dots and merge duplicates. On Gold, aggregate into bar charts representing metrics per business domain. The entire flow moves upward through three translucent horizontal bands.

Query latency vs. flexibility spectrum

Horizontal axis: query flexibility (low-to-high). Vertical axis: query latency (low-to-high). Plot Warehouse (bottom-left: low flexibility, fast), Lake (top-right: high flexibility, slow), Lakehouse (bottom-right: high flexibility, fast with caching). Animate the Lakehouse dot moving toward the bottom-right over time as caching warms.

Making it sustainable

Data governance prevents the data swamp

Every data lake eventually becomes a data swamp without governance. The same three pillars apply regardless of storage paradigm — the tooling differs but the intent is identical. Animate as three interlocking gears that must all turn for the system to function.

Checklist

Catalog every dataset with ownership, freshness SLA, schema version, and data classification (PII level).
Enforce column-level access control — analysts should never see raw PII even if they can query the table.
Implement data contracts between producers and consumers: schema changes require a version bump and migration path.
Run automated data quality checks (Great Expectations, Soda) on every Silver table before Gold refresh.
Set retention policies per data class: raw logs 90 days, PII-masked Silver 2 years, aggregated Gold indefinitely.

DevOps · MLOps

DevOps to MLOps: Building the Shared Delivery Muscle

DevOps taught teams to ship code like a disciplined factory line; MLOps adds a third moving part, data, and suddenly the factory floor shifts under your feet. This guide shows what transfers cleanly and what breaks.

10 min read

MLOps

MLOps Systems Blueprint for Reliable AI

Production ML behaves like a three-body problem: code, data, and live behavior all pull in different directions. This guide shows how to turn that motion into a stable, self-correcting delivery loop.

9 min read

Privacy-Preserving ML

Federated Learning: Training Models Without Moving Data

Federated learning flips the usual gravity of ML: instead of hauling sensitive data to one warehouse, it sends the model out like a traveling teacher and brings back only the lessons. This guide explains the math and the operational trade-offs.

11 min read

All articles