Retrieval-Augmented Generation: Architecture, Evaluation, and Production
RAG gives an LLM a memory it can check instead of bluffing from a frozen past. This guide follows the full pipeline from chunking to evaluation so a prototype can grow into a production system.
The core motivation
LLMs are frozen knowledge snapshots — RAG makes them current
Every LLM has a knowledge cutoff. Ask a GPT-4-class model about last quarter's earnings or your internal product documentation and it either hallucinates or declines. RAG (Lewis et al., 2020) solves this by retrieving relevant passages from an external corpus at inference time and injecting them into the context window alongside the user query. The model now reasons over live, verifiable facts — without retraining. The critical insight is that retrieval and generation are decoupled: you can upgrade either independently. Visualize as a two-stage pipeline: a fast retriever returning ranked passages, and a slower generator synthesizing them into a coherent answer.
Why not just fine-tune?
Fine-tuning bakes knowledge into weights — opaque and expensive to update. RAG keeps knowledge in an external store you can version, audit, and refresh in hours. Use fine-tuning for style/format/task adaptation; use RAG for factual grounding.
Why not just extend context?
128k-token context windows sound like a solution, but stuffing every document degrades answer quality (lost-in-the-middle effect), inflates cost, and explodes latency. Retrieval selects the relevant 5 % — precision beats brute force.
Hybrid: RAG + fine-tuning
The current best practice: fine-tune for task format and tone, RAG for factual recall. The fine-tuned model learns how to use retrieved context; the retriever keeps facts current. Separate concerns, independent upgrade paths.
System architecture
Five stages, two paths — the full RAG pipeline
A production RAG system has two paths: an offline indexing path (runs once or on update) and an online serving path (runs per query). The offline path chunks, embeds, and stores documents. The online path embeds the query, retrieves top-K chunks, optionally re-ranks, and generates an answer. Visualize as a horizontal dual-lane diagram: the top lane (offline, grey background) feeding the bottom lane (online, glowing) via a shared vector store in the middle.
Document Ingestion & Chunking
OfflineSplit source documents into chunks that fit the embedding model's context (typically 256–512 tokens). Overlapping chunks (10–15%) prevent answers from being split across chunk boundaries. Recursive character splitting respects sentence and paragraph boundaries.
Embedding & Indexing
OfflineEncode each chunk into a dense vector using a bi-encoder model (OpenAI text-embedding-3, Cohere embed-v3, BGE). Store vectors in a vector database (Pinecone, Weaviate, pgvector, ChromaDB). Build ANN index (HNSW or IVF) for sub-millisecond retrieval at scale.
Query Encoding & Retrieval
Online · ~10msEncode the user query with the same embedding model. Retrieve top-K candidates by cosine similarity. Optionally fuse with BM25 keyword scores (hybrid retrieval) — sparse+dense fusion catches exact-match terms that embeddings sometimes miss.
Re-ranking
Online · ~50msA cross-encoder re-ranker (BGE-reranker, Cohere Rerank) scores each candidate against the query jointly — much more accurate than dot products but too slow for full-corpus search. Apply to top-20 candidates, keep top-5.
Generation with Grounding
Online · ~500msInject retrieved chunks + citations into the prompt context. Instruct the model to answer strictly from the provided context and cite sources. Parse citations in the response to enable downstream fact-checking and UI attribution.
Retrieval mechanics
Similarity search: the math behind finding relevant chunks
Vector retrieval reduces to computing distances in high-dimensional space. Cosine similarity is preferred over Euclidean distance because it is magnitude-invariant — a long document and a short document about the same topic should be equally retrievable. Approximate Nearest Neighbor (ANN) algorithms trade a small accuracy loss (0.1–1%) for orders-of-magnitude speed gains on million-scale corpora.
HNSW index
Hierarchical Navigable Small World graphs are the default ANN algorithm. Build time O(n log n), query time O(log n). ef_construction controls build quality vs. time; ef_search controls recall vs. latency at query time.
Chunk size trade-off
Smaller chunks (128 tokens) → higher precision, weaker context. Larger chunks (1024 tokens) → richer context, lower precision. Sentence Window Retrieval: retrieve small chunks, expand to full paragraph at generation time.
Cosine similarity between query vector q and document chunk vector d. Range −1 to 1; higher = more relevant. For unit-normalized vectors (as most embedding models produce), this equals the dot product — enabling extremely fast BLAS-accelerated computation.
Hybrid retrieval fuses dense (embedding) and sparse (BM25) scores. α ≈ 0.6 typically favors semantic over lexical. Tune on a held-out validation set with NDCG@10 as the target metric.
Measuring RAG quality
RAGAS: four metrics that cover every failure mode
Evaluating RAG is harder than evaluating static models — you must measure the retriever and the generator separately, then their composition. The RAGAS framework (Es et al., 2023) decomposes quality into four metrics measurable without human annotation using an LLM-as-judge. Visualize as a 2×2 heatmap: columns = retrieval vs. generation, rows = faithfulness vs. relevance.
Claims ↔ Context
Faithfulness
Answer ↔ Query
Answer Relevance
Context ↔ Query
Context Precision
Context ↔ Ground Truth
Context Recall
Checklist
- Build a 100–500 question golden test set (question, ground truth answer, relevant document IDs) before optimizing anything.
- Instrument every query with retrieved chunk IDs and their similarity scores — this is your debugging surface.
- Set up A/B experiments in your retrieval config: chunk size, overlap, K, re-ranker on/off.
- Monitor retrieval latency separately from generation latency — they fail for different reasons.
- Add fallback behavior: if top-1 similarity < threshold (e.g., 0.6), respond "I don't have information on this" rather than hallucinate.
Beyond naive RAG
Advanced patterns: query rewriting, HyDE, and agentic RAG
Naive RAG (embed query → retrieve → generate) fails on multi-hop questions, ambiguous queries, and queries that require synthesis across many documents. Advanced RAG patterns address each failure mode systematically.
Query Rewriting
Use an LLM to rephrase the user query into multiple search-optimized sub-queries before retrieval. Decompose "compare LSTM and Transformer for time series" into separate retrievals for each architecture, then synthesize.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer to the question, embed that answer, and retrieve using it. The hypothesis is never shown to the user — it just serves as a high-quality retrieval probe that outperforms embedding the question directly on knowledge-intensive tasks.
Agentic RAG
Replace the single retrieval step with a reasoning loop: the agent decides what to retrieve, evaluates the result, and iterates if needed. ReAct, FLARE, and Self-RAG are concrete implementations. Useful when the query requires multi-step evidence chaining.
GraphRAG
Build a knowledge graph from document entities and relationships. Retrieve by traversing graph edges, not just nearest-neighbor search. Microsoft GraphRAG shows significant gains on "global" queries requiring cross-document synthesis.
Related posts
LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts
A base LLM is a general instrument; fine-tuning changes how tightly it resonates with your task. This guide maps the adaptation spectrum from prompting to MoE, with the math behind each trade-off.
13 min readAI Agents: From ReAct to Multi-Agent Systems
An agent is what happens when an LLM stops answering once and starts acting repeatedly in the world. This guide traces the control loops, tool use, and guardrails that separate a demo agent from a dependable one.
13 min readOWASP Top 10 for LLM Apps: Real Attacks, Real Fixes
For LLM apps, the attack often arrives as plain language rather than obviously malicious code. This guide walks through the OWASP risks as real failure stories, then shows the concrete controls that stop them.
16 min read