LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts
Back to blog
Adaptation MethodsLarge Language Models

LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts

A base LLM is a general instrument; fine-tuning changes how tightly it resonates with your task. This guide maps the adaptation spectrum from prompting to MoE, with the math behind each trade-off.

13 min readMarch 17, 2026
LLMLoRAQLoRARLHFDPOMoEPEFT

Start with the right question

Do you need to fine-tune at all? The adaptation spectrum.

The most expensive mistake in LLM engineering is fine-tuning when prompting would suffice. Work left-to-right across the spectrum: only move to the next technique when the current one demonstrably fails. Each step increases capability — and cost, risk of forgetting, and maintenance burden.

Zero cost

Prompting / few-shot ICL

Zero or few labeled examples in the context window. Zero compute overhead. Limited by context length and the base model reasoning ceiling. Start here for every new task.

No weight change

RAG (Retrieval-Augmented)

Inject retrieved documents as context at inference time. Keeps the model factually current without retraining. Fails when the task requires new reasoning patterns, not just new facts.

< 1 % params

PEFT (Parameter-Efficient FT)

Freeze the base model. Train only a tiny adapter (< 1 % of params). LoRA, adapters, prefix tuning, IA3. The current industry default for LLM specialization.

100 % params

Full fine-tuning

Update all weights on domain-specific data. Maximum capacity, maximum cost. Justified for deep domain shift (medical imaging reports, legal contract parsing, code generation).

The PEFT workhorse

LoRA and QLoRA — low-rank adaptation is a change-of-basis trick

The LoRA insight: weight updates during fine-tuning are intrinsically low-rank. Rather than storing the full delta matrix, factorize it into two small matrices. At inference, merge them back into the frozen weights — zero added latency. QLoRA extends this by quantizing the frozen base to 4-bit NF4 format, making it possible to fine-tune a 65B model on a single 48 GB GPU.

Architecture

Which layers to adapt

Apply LoRA to Q, K, V projections in attention — these carry most task-specific signal. Add MLP layers for domain shift. Skip embeddings and LayerNorm unless you are adding new vocabulary.

Hyperparameter

Rank selection heuristic

r=4 for style/tone adaptation, r=8 for task specialization, r=16-64 for strong domain shift. Higher rank = more expressiveness, higher overfitting risk on small datasets.

Memory-efficient

QLoRA: quantize, then adapt

Freeze the base model in 4-bit NF4 format with double quantization. Train LoRA adapters in BF16. The quantization error is absorbed by the adapter, achieving near-full-FT quality at 1/4 the memory.

3-4x less

Memory vs full FT

~0.1 %

Params trained

W=W0+ΔW=W0+BA,BRd×r,  ARr×kW^{\prime} = W_0 + \Delta W = W_0 + B\,A, \quad B\in\mathbb{R}^{d\times r},\;A\in\mathbb{R}^{r\times k}

W_0 is frozen. Only B and A (rank r) are trained. With r=8, a 4096x4096 weight matrix reduces from 16 M to 65 K trainable parameters — a 250x compression.

trainable params=r(d+k)dk\text{trainable params} = r\,(d + k) \ll d\cdot k

At rank r=16, a Llama-2 70B model requires only ~0.1 % of original parameters to fine-tune. Visualize as two thin colored strips (B and A) against a large frozen grey matrix.

Beyond task tuning

Alignment: RLHF, DPO, and ORPO

Task fine-tuning teaches what to do. Alignment training teaches how to behave — making models helpful, harmless, and honest. RLHF runs a full RL loop with a reward model. DPO eliminates the reward model by reframing preference learning as a binary classification problem. DPO has largely displaced RLHF in practice because it is 3x simpler and equally effective.

Classic

RLHF (three stages)

Step 1: SFT on demonstrations. Step 2: Train a reward model on human preference pairs. Step 3: PPO to maximize reward while staying close to SFT via KL penalty. Powerful but brittle and slow to train.

Preferred

DPO (one loss)

Reformulates RLHF as a contrastive supervised loss on (prompt, winning, losing) triples. Same final quality as RLHF, no RL training loop, no reward model to maintain.

Latest (2024)

ORPO (one stage)

Merges SFT and alignment into a single training stage by adding a preference penalty to the standard language modeling cross-entropy loss. Fewer hyperparameters, faster convergence.

LDPO=E[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

DPO optimizes the policy to prefer winning response y_w over losing response y_l, relative to a frozen reference policy. No reward model needed — just labeled preference pairs.

Sparse scaling

Mixture-of-Experts: 8x the capacity, same inference cost

Standard Transformer FFN blocks activate every parameter for every token. MoE replaces the single FFN with N parallel expert networks and a learned router that routes each token to the top-K experts. At 8 experts with K=2, only 2/8 of expert compute activates per token — but the model has 8x more total capacity. Mixtral 8x7B has 45B total parameters but uses only 12B active per forward pass, matching Llama-2 70B on most benchmarks.

Visualization

Expert activation heatmap

Visualize as a grid of N experts x T tokens. Cell brightness = gate weight. Overloaded experts (bright columns) and dead experts (dark columns) are both failure modes visible at a glance.

Adaptation

Fine-tuning MoE models

Apply LoRA per expert independently, or use shared LoRA adapters with per-expert residuals. Monitor expert utilization histograms per epoch — routing behavior shifts during fine-tuning.

Engineering

Capacity factor tuning

Each expert has a fixed token capacity budget per batch. Token dropping occurs when capacity is exceeded. Tune capacity_factor (default 1.25) to balance throughput and quality on your hardware.

2 / 8 experts

Active params per token

~4x

Effective capacity gain

y=i=1NG(x)iEi(x),G(x)=TopK ⁣(softmax(Wx),K)y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x), \quad G(x) = \text{TopK}\!\left(\text{softmax}(Wx),\,K\right)

Router G assigns gate weights to experts for each token. Only top-K experts receive nonzero weights. Animate tokens routing to different expert columns in a grid visualization.

Laux=αNi=1NfiPi\mathcal{L}_{\text{aux}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i

Auxiliary load-balancing loss prevents expert collapse (all tokens routed to one expert). f_i = fraction of tokens routed to expert i, P_i = mean router probability for expert i.