đŹLarge Language Models
Predicting the next token at scale
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
The parrot that learned to think. By predicting the next word billions of times, these models picked up surprising skillsâfrom reasoning to coding.
How this lesson fits
These lessons power the language revolution. We turn words into math and teach models to track context and meaning as they read.
The big question
How can a model capture word meaning, hold onto context, and generate fluent language one token at a time?
Why You Should Care
Students need a grounded view: what LLMs do well, where they stumble, and why next-token prediction can still feel smart.
Where this is used today
- âChatGPT / Claude / Gemini
- âCode autocompletion (GitHub Copilot)
- âSummarizing long documents
Think of it like this
Like an improv actor who never knows the ending, but is so well-read they can riff convincingly on almost anything.
Easy mistake to make
Fluent answers arenât guaranteed truths. Confidence and correctness are not the same.
By the end, you should be able to say:
- Explain next-token prediction in simple language
- Connect pretraining, prompting, and fine-tuning
- Discuss why fluent output is not the same as guaranteed truth
Think about this first
Why might predicting the next word over and over teach a model more than just grammar?
Words we will keep using
Large Language Models
An LLM is basically a super-powered autocomplete. It reads text, learns the patterns, and predicts what comes next. That sounds simple, but do it enough times on enough data, and "predicting the next word" starts to look a lot like reasoning.
P(xâ, xâ, ..., xâ) = â P(xâ | xâ, ..., xâââââ)
That training objective sounds small, but it turns out to be enough to teach the model a surprising amount about language and pattern structure.
Next-Token Prediction Demo
When a chatbot writes a poem, it isn't "thinking" of the whole poem at once. It's just rolling the dice on the very next word, over and over again. The magic is that the probabilities are so well-tuned that the result makes sense.
Context (last 2 words):
Next token probabilities (T-adjusted):
Low temperature makes the model more conservative. High temperature spreads probability more widely and makes outputs less predictable.
Key Concepts
The Reading Phase. The model reads the internet to learn grammar, facts, and reasoning patterns. It is self-taught.
The Manners Phase. Humans step in to teach the model how to be helpful, harmless, and follow instructions.
The "Just Ask" Phase. You don't need to retrain the model to teach it a new trickâjust show it an example in the prompt.
The Surprise Factor. When models get big enough, they suddenly learn to do things (like coding) that they weren't explicitly built for.
Scaling Laws
Here is the weirdest thing about LLMs: they are incredibly predictable. If you add more data or make the model bigger, it gets smarter at a very specific, mathematical rate. We can literally graph the "intelligence" (perplexity) before we even build the model.
The exact curve is less important than the message: progress often looked smooth and predictable as models got larger.
The Scaling Timeline
Attention Is All You Need â architecture breakthrough
Bidirectional pre-training, fine-tuning paradigm
Autoregressive language model, "too dangerous to release"
Few-shot learning, emergent capabilities at scale
RLHF alignment, conversational AI goes mainstream
Multimodal reasoning, open-source competition begins in earnest
Long context (1M+ tokens), efficiency push, open-source closes the gap
Test-time compute scaling, chain-of-thought reasoning models, DeepSeek disrupts cost assumptions globally
Reasoning-native by default, multimodal as baseline, efficiency over raw parameter count becomes the metric
NLP Evaluation Metrics
Scoring generated text is harder than scoring a yes/no answer. There is usually more than one acceptable sentence, so NLP uses a family of metrics that each capture only part of quality.
Interactive BLEU / ROUGE Calculator
Green = matched in reference  ¡ Red = unmatched
BLEU-1
71.4
unigram precision
BP = 1.000
BLEU-2
59.8
+bigram precision
geo-mean
ROUGE-1
76.9
unigram F1
R=83% P=71%
ROUGE-2
54.5
bigram F1
R=60% P=50%
BLEU precision bias: try making the hypothesis much shorter than the reference. BLEU-1 stays high but BLEU-2 drops â the brevity penalty partially corrects for this.
ROUGE recall bias: now make the hypothesis much longer. ROUGE recall stays high because it measures how much of the reference you covered, regardless of extra words.
Both fail on paraphrase: replace a reference word with a perfect synonym. Both scores drop â neither metric understands meaning, only surface overlap.
| Metric | What it measures | Formula core | Best for | Weakness |
|---|---|---|---|---|
| Perplexity | How surprised the model is by a test corpus | exp(â(1/N) ÎŁ log P(xâ|x<t)) | LM comparison | Not comparable across tokenisations |
| BLEU | N-gram precision of hypothesis vs reference(s) | BP ¡ exp(ÎŁ wâ log pâ) | Machine translation | Precision-biased; punishes paraphrase |
| ROUGE-N | N-gram recall of reference covered by hypothesis | matched-ngrams / ref-ngrams | Summarisation | Recall-biased; padding inflates score |
| ROUGE-L | Longest Common Subsequence F1 | LCS(hyp, ref) / len(ref) | Summarisation, fluency | Ignores non-contiguous order quality |
| METEOR | Alignment F1 with synonym matching & stemming | F_mean ¡ (1 â penalty) | Translation; handles paraphrase | Language-dependent synonym tables |
| BERTScore | Token cosine similarity via contextual BERT embeddings | P¡R¡F1 over embedding matches | Any generation; semantic quality | Expensive; model-dependent |
| chrF | Character n-gram F-score â no word boundary dependence | F1 over char-ngrams | Morphologically rich languages | Less interpretable than word-level |
BLEU vs ROUGE in one sentence
BLEU cares more about how much of your output matches the reference. ROUGE cares more about how much of the reference you managed to cover.
The paraphrase problem
If two sentences mean the same thing but use different words, simple overlap metrics can score them unfairly. That is a major weakness to keep in mind.
Perplexity â quality
A model can be good at predicting text patterns and still say false, unsafe, or unhelpful things. Fluent output is not the same as trustworthy output.