Data Warehouse, Data Lake, and Lakehouse: A Visual Architecture Guide
Warehouses, lakes, and lakehouses are really three answers to one question: when should raw data be forced into shape? This guide turns that architectural choice into concrete diagrams and decision rules.
One key question
Schema-on-write vs. schema-on-read: where does structure live?
The entire Data Warehouse vs. Data Lake debate reduces to a single architectural decision: do you enforce structure when data arrives (schema-on-write), or when data is queried (schema-on-read)? Schema-on-write guarantees query performance and data quality at the cost of ingestion flexibility. Schema-on-read maximizes ingestion speed and handles raw, heterogeneous data at the cost of query-time overhead and potential quality surprises. The Lakehouse pattern attempts to get both.
Data Warehouse (schema-on-write)
Structured, modeled data. ETL pipelines transform and load into defined schemas before storage. Optimized for fast analytical queries via columnar storage and pre-built aggregates. Examples: Snowflake, BigQuery, Redshift.
Data Lake (schema-on-read)
Raw data in native format — CSV, JSON, Parquet, images, audio. Store everything, model it later. Flexible and cheap storage. Risk: "data swamp" if governance is absent. Examples: S3, ADLS, GCS.
Lakehouse (both)
Open table formats (Delta Lake, Iceberg, Hudi) add ACID transactions, schema evolution, and time travel on top of object storage. One copy of data serves both BI dashboards and ML training jobs.
Medallion architecture
Bronze, Silver, Gold: the three-layer data quality staircase
The Medallion (or Delta) architecture is the most widely adopted pattern for organizing a Lakehouse. Data enters raw, gets progressively cleaned and enriched, and finally arrives in a form optimized for consumption. Visualize as three horizontal layers with data flowing upward through refinement stages, each layer labeled with quality guarantees and latency characteristics.
Bronze (Raw landing)
Near real-timeRaw ingestion from all sources: event streams, API webhooks, database CDC feeds, file drops. Data is stored exactly as received with an ingestion timestamp. No transformations, no quality checks. Append-only, fully auditable.
Silver (Validated & conformed)
Minutes to hoursCleansed, deduplicated, and schema-enforced. PII masked or tokenized. Referential integrity checks applied. This layer is the single source of truth for data engineers — all downstream pipelines read from here.
Gold (Aggregated & domain-specific)
Scheduled refreshPre-aggregated metrics, feature tables for ML, domain-specific denormalized views for BI tools. Optimized for specific consumption patterns. One Gold table per team or product area.
~$23/TB/mo
Storage cost
100ms – 10s
Query latency
Decision framework
Choosing between warehouse, lake, and lakehouse
Map your requirements to the right paradigm using this MECE decision tree. The three options are non-overlapping in their primary strengths — choose the one that matches your dominant constraint, then use the others for their secondary strengths.
Use a Warehouse when...
Your primary consumers are SQL analysts and BI dashboards. Data is structured and well-understood. Query performance SLAs are strict. Budget allows for managed compute. Governance and access control are paramount.
Use a Lake when...
You need to store raw, heterogeneous data (logs, images, sensor streams) before you know how it will be used. ML training jobs need direct file access. Cost per byte is the primary constraint.
Use a Lakehouse when...
You have both analytical (SQL) and ML workloads reading the same data. You need ACID transactions and time travel for data corrections. You want one storage layer instead of two synced systems.
Data flow animation
Animate raw events entering Bronze as a stream of colored dots. On Silver layer, filter out red (invalid) dots and merge duplicates. On Gold, aggregate into bar charts representing metrics per business domain. The entire flow moves upward through three translucent horizontal bands.
Query latency vs. flexibility spectrum
Horizontal axis: query flexibility (low-to-high). Vertical axis: query latency (low-to-high). Plot Warehouse (bottom-left: low flexibility, fast), Lake (top-right: high flexibility, slow), Lakehouse (bottom-right: high flexibility, fast with caching). Animate the Lakehouse dot moving toward the bottom-right over time as caching warms.
Making it sustainable
Data governance prevents the data swamp
Every data lake eventually becomes a data swamp without governance. The same three pillars apply regardless of storage paradigm — the tooling differs but the intent is identical. Animate as three interlocking gears that must all turn for the system to function.
Checklist
- Catalog every dataset with ownership, freshness SLA, schema version, and data classification (PII level).
- Enforce column-level access control — analysts should never see raw PII even if they can query the table.
- Implement data contracts between producers and consumers: schema changes require a version bump and migration path.
- Run automated data quality checks (Great Expectations, Soda) on every Silver table before Gold refresh.
- Set retention policies per data class: raw logs 90 days, PII-masked Silver 2 years, aggregated Gold indefinitely.
Related posts
DevOps to MLOps: Building the Shared Delivery Muscle
DevOps taught teams to ship code like a disciplined factory line; MLOps adds a third moving part, data, and suddenly the factory floor shifts under your feet. This guide shows what transfers cleanly and what breaks.
10 min readMLOps Systems Blueprint for Reliable AI
Production ML behaves like a three-body problem: code, data, and live behavior all pull in different directions. This guide shows how to turn that motion into a stable, self-correcting delivery loop.
9 min readFederated Learning: Training Models Without Moving Data
Federated learning flips the usual gravity of ML: instead of hauling sensitive data to one warehouse, it sends the model out like a traveling teacher and brings back only the lessons. This guide explains the math and the operational trade-offs.
11 min read