Retrieval SystemsLLM Systems

Retrieval-Augmented Generation: Architecture, Evaluation, and Production

RAG gives an LLM a memory it can check instead of bluffing from a frozen past. This guide follows the full pipeline from chunking to evaluation so a prototype can grow into a production system.

12 min readMarch 17, 2026

RAGVector SearchEmbeddingsLLMRetrieval

The core motivation

LLMs are frozen knowledge snapshots — RAG makes them current

Every LLM has a knowledge cutoff. Ask a GPT-4-class model about last quarter's earnings or your internal product documentation and it either hallucinates or declines. RAG (Lewis et al., 2020) solves this by retrieving relevant passages from an external corpus at inference time and injecting them into the context window alongside the user query. The model now reasons over live, verifiable facts — without retraining. The critical insight is that retrieval and generation are decoupled: you can upgrade either independently. Visualize as a two-stage pipeline: a fast retriever returning ranked passages, and a slower generator synthesizing them into a coherent answer.

Design choice

Why not just fine-tune?

Fine-tuning bakes knowledge into weights — opaque and expensive to update. RAG keeps knowledge in an external store you can version, audit, and refresh in hours. Use fine-tuning for style/format/task adaptation; use RAG for factual grounding.

Context limits

Why not just extend context?

128k-token context windows sound like a solution, but stuffing every document degrades answer quality (lost-in-the-middle effect), inflates cost, and explodes latency. Retrieval selects the relevant 5 % — precision beats brute force.

Best of both

Hybrid: RAG + fine-tuning

The current best practice: fine-tune for task format and tone, RAG for factual recall. The fine-tuned model learns how to use retrieved context; the retriever keeps facts current. Separate concerns, independent upgrade paths.

System architecture

Five stages, two paths — the full RAG pipeline

A production RAG system has two paths: an offline indexing path (runs once or on update) and an online serving path (runs per query). The offline path chunks, embeds, and stores documents. The online path embeds the query, retrieves top-K chunks, optionally re-ranks, and generates an answer. Visualize as a horizontal dual-lane diagram: the top lane (offline, grey background) feeding the bottom lane (online, glowing) via a shared vector store in the middle.

Document Ingestion & Chunking

Offline

Split source documents into chunks that fit the embedding model's context (typically 256–512 tokens). Overlapping chunks (10–15%) prevent answers from being split across chunk boundaries. Recursive character splitting respects sentence and paragraph boundaries.

Embedding & Indexing

Offline

Encode each chunk into a dense vector using a bi-encoder model (OpenAI text-embedding-3, Cohere embed-v3, BGE). Store vectors in a vector database (Pinecone, Weaviate, pgvector, ChromaDB). Build ANN index (HNSW or IVF) for sub-millisecond retrieval at scale.

Query Encoding & Retrieval

Online · ~10ms

Encode the user query with the same embedding model. Retrieve top-K candidates by cosine similarity. Optionally fuse with BM25 keyword scores (hybrid retrieval) — sparse+dense fusion catches exact-match terms that embeddings sometimes miss.

Re-ranking

Online · ~50ms

A cross-encoder re-ranker (BGE-reranker, Cohere Rerank) scores each candidate against the query jointly — much more accurate than dot products but too slow for full-corpus search. Apply to top-20 candidates, keep top-5.

Generation with Grounding

Online · ~500ms

Inject retrieved chunks + citations into the prompt context. Instruct the model to answer strictly from the provided context and cite sources. Parse citations in the response to enable downstream fact-checking and UI attribution.

Retrieval mechanics

Similarity search: the math behind finding relevant chunks

Vector retrieval reduces to computing distances in high-dimensional space. Cosine similarity is preferred over Euclidean distance because it is magnitude-invariant — a long document and a short document about the same topic should be equally retrievable. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss (0.1–1%) for orders-of-magnitude speed gains on million-scale corpora.

ANN

HNSW index

Hierarchical Navigable Small World graphs are the default ANN algorithm. Build time O(n log n), query time O(log n). ef_construction controls build quality vs. time; ef_search controls recall vs. latency at query time.

Engineering

Chunk size trade-off

Smaller chunks (128 tokens) → higher precision, weaker context. Larger chunks (1024 tokens) → richer context, lower precision. Sentence Window Retrieval: retrieve small chunks, expand to full paragraph at generation time.

\text{sim}(q, d) = \frac{q \cdot d}{\|q\| \|d\|}

Cosine similarity between query vector q and document chunk vector d. Range −1 to 1; higher = more relevant. For unit-normalized vectors (as most embedding models produce), this equals the dot product — enabling extremely fast BLAS-accelerated computation.

S_{\text{hybrid}} = \alpha \cdot S_{\text{dense}} + (1-\alpha) \cdot S_{\text{sparse}}

Hybrid retrieval fuses dense (embedding) and sparse (BM25) scores. α ≈ 0.6 typically favors semantic over lexical. Tune on a held-out validation set with NDCG@10 as the target metric.

Measuring RAG quality

RAGAS: four metrics that cover every failure mode

Evaluating RAG is harder than evaluating static models — you must measure the retriever and the generator separately, then their composition. The RAGAS framework (Es et al., 2023) decomposes quality into four metrics measurable without human annotation using an LLM-as-judge. Visualize as a 2×2 heatmap: columns = retrieval vs. generation, rows = faithfulness vs. relevance.

Claims ↔ Context

Faithfulness

Answer ↔ Query

Answer Relevance

Context ↔ Query

Context Precision

Context ↔ Ground Truth

Context Recall

Checklist

Build a 100–500 question golden test set (question, ground truth answer, relevant document IDs) before optimizing anything.
Instrument every query with retrieved chunk IDs and their similarity scores — this is your debugging surface.
Set up A/B experiments in your retrieval config: chunk size, overlap, K, re-ranker on/off.
Monitor retrieval latency separately from generation latency — they fail for different reasons.
Add fallback behavior: if top-1 similarity < threshold (e.g., 0.6), respond "I don't have information on this" rather than hallucinate.

Beyond naive RAG

Advanced patterns: query rewriting, HyDE, and agentic RAG

Naive RAG (embed query → retrieve → generate) fails on multi-hop questions, ambiguous queries, and queries that require synthesis across many documents. Advanced RAG patterns address each failure mode systematically.

Multi-query

Query Rewriting

Use an LLM to rephrase the user query into multiple search-optimized sub-queries before retrieval. Decompose "compare LSTM and Transformer for time series" into separate retrievals for each architecture, then synthesize.

Retrieval quality

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer to the question, embed that answer, and retrieve using it. The hypothesis is never shown to the user — it just serves as a high-quality retrieval probe that outperforms embedding the question directly on knowledge-intensive tasks.

Multi-step

Agentic RAG

Replace the single retrieval step with a reasoning loop: the agent decides what to retrieve, evaluates the result, and iterates if needed. ReAct, FLARE, and Self-RAG are concrete implementations. Useful when the query requires multi-step evidence chaining.

Structural

GraphRAG

Build a knowledge graph from document entities and relationships. Retrieve by traversing graph edges, not just nearest-neighbor search. Microsoft GraphRAG shows significant gains on "global" queries requiring cross-document synthesis.

Large Language Models

LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts

A base LLM is a general instrument; fine-tuning changes how tightly it resonates with your task. This guide maps the adaptation spectrum from prompting to MoE, with the math behind each trade-off.

13 min read

LLM Systems

AI Agents: From ReAct to Multi-Agent Systems

An agent is what happens when an LLM stops answering once and starts acting repeatedly in the world. This guide traces the control loops, tool use, and guardrails that separate a demo agent from a dependable one.

13 min read

AI Security

OWASP Top 10 for LLM Apps: Real Attacks, Real Fixes

For LLM apps, the attack often arrives as plain language rather than obviously malicious code. This guide walks through the OWASP risks as real failure stories, then shows the concrete controls that stop them.

16 min read

All articles