LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts
A base LLM is a general instrument; fine-tuning changes how tightly it resonates with your task. This guide maps the adaptation spectrum from prompting to MoE, with the math behind each trade-off.
Start with the right question
Do you need to fine-tune at all? The adaptation spectrum.
The most expensive mistake in LLM engineering is fine-tuning when prompting would suffice. Work left-to-right across the spectrum: only move to the next technique when the current one demonstrably fails. Each step increases capability — and cost, risk of forgetting, and maintenance burden.
Prompting / few-shot ICL
Zero or few labeled examples in the context window. Zero compute overhead. Limited by context length and the base model reasoning ceiling. Start here for every new task.
RAG (Retrieval-Augmented)
Inject retrieved documents as context at inference time. Keeps the model factually current without retraining. Fails when the task requires new reasoning patterns, not just new facts.
PEFT (Parameter-Efficient FT)
Freeze the base model. Train only a tiny adapter (< 1 % of params). LoRA, adapters, prefix tuning, IA3. The current industry default for LLM specialization.
Full fine-tuning
Update all weights on domain-specific data. Maximum capacity, maximum cost. Justified for deep domain shift (medical imaging reports, legal contract parsing, code generation).
The PEFT workhorse
LoRA and QLoRA — low-rank adaptation is a change-of-basis trick
The LoRA insight: weight updates during fine-tuning are intrinsically low-rank. Rather than storing the full delta matrix, factorize it into two small matrices. At inference, merge them back into the frozen weights — zero added latency. QLoRA extends this by quantizing the frozen base to 4-bit NF4 format, making it possible to fine-tune a 65B model on a single 48 GB GPU.
Which layers to adapt
Apply LoRA to Q, K, V projections in attention — these carry most task-specific signal. Add MLP layers for domain shift. Skip embeddings and LayerNorm unless you are adding new vocabulary.
Rank selection heuristic
r=4 for style/tone adaptation, r=8 for task specialization, r=16-64 for strong domain shift. Higher rank = more expressiveness, higher overfitting risk on small datasets.
QLoRA: quantize, then adapt
Freeze the base model in 4-bit NF4 format with double quantization. Train LoRA adapters in BF16. The quantization error is absorbed by the adapter, achieving near-full-FT quality at 1/4 the memory.
3-4x less
Memory vs full FT
~0.1 %
Params trained
W_0 is frozen. Only B and A (rank r) are trained. With r=8, a 4096x4096 weight matrix reduces from 16 M to 65 K trainable parameters — a 250x compression.
At rank r=16, a Llama-2 70B model requires only ~0.1 % of original parameters to fine-tune. Visualize as two thin colored strips (B and A) against a large frozen grey matrix.
Beyond task tuning
Alignment: RLHF, DPO, and ORPO
Task fine-tuning teaches what to do. Alignment training teaches how to behave — making models helpful, harmless, and honest. RLHF runs a full RL loop with a reward model. DPO eliminates the reward model by reframing preference learning as a binary classification problem. DPO has largely displaced RLHF in practice because it is 3x simpler and equally effective.
RLHF (three stages)
Step 1: SFT on demonstrations. Step 2: Train a reward model on human preference pairs. Step 3: PPO to maximize reward while staying close to SFT via KL penalty. Powerful but brittle and slow to train.
DPO (one loss)
Reformulates RLHF as a contrastive supervised loss on (prompt, winning, losing) triples. Same final quality as RLHF, no RL training loop, no reward model to maintain.
ORPO (one stage)
Merges SFT and alignment into a single training stage by adding a preference penalty to the standard language modeling cross-entropy loss. Fewer hyperparameters, faster convergence.
DPO optimizes the policy to prefer winning response y_w over losing response y_l, relative to a frozen reference policy. No reward model needed — just labeled preference pairs.
Sparse scaling
Mixture-of-Experts: 8x the capacity, same inference cost
Standard Transformer FFN blocks activate every parameter for every token. MoE replaces the single FFN with N parallel expert networks and a learned router that routes each token to the top-K experts. At 8 experts with K=2, only 2/8 of expert compute activates per token — but the model has 8x more total capacity. Mixtral 8x7B has 45B total parameters but uses only 12B active per forward pass, matching Llama-2 70B on most benchmarks.
Expert activation heatmap
Visualize as a grid of N experts x T tokens. Cell brightness = gate weight. Overloaded experts (bright columns) and dead experts (dark columns) are both failure modes visible at a glance.
Fine-tuning MoE models
Apply LoRA per expert independently, or use shared LoRA adapters with per-expert residuals. Monitor expert utilization histograms per epoch — routing behavior shifts during fine-tuning.
Capacity factor tuning
Each expert has a fixed token capacity budget per batch. Token dropping occurs when capacity is exceeded. Tune capacity_factor (default 1.25) to balance throughput and quality on your hardware.
2 / 8 experts
Active params per token
~4x
Effective capacity gain
Router G assigns gate weights to experts for each token. Only top-K experts receive nonzero weights. Animate tokens routing to different expert columns in a grid visualization.
Auxiliary load-balancing loss prevents expert collapse (all tokens routed to one expert). f_i = fraction of tokens routed to expert i, P_i = mean router probability for expert i.
Related posts
Neural Architectures Decoded: FFNN, RNN, and Transformers
Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.
10 min readRetrieval-Augmented Generation: Architecture, Evaluation, and Production
RAG gives an LLM a memory it can check instead of bluffing from a frozen past. This guide follows the full pipeline from chunking to evaluation so a prototype can grow into a production system.
12 min readAI Agents: From ReAct to Multi-Agent Systems
An agent is what happens when an LLM stops answering once and starts acting repeatedly in the world. This guide traces the control loops, tool use, and guardrails that separate a demo agent from a dependable one.
13 min read