Language & TransformersAdvanced

💬Large Language Models

Predicting the next token at scale

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min

Pause and experiment as you go.

30 min- Explore at your own pace

Before We Begin

What we are learning today

The parrot that learned to think. By predicting the next word billions of times, these models picked up surprising skills—from reasoning to coding.

How this lesson fits

These lessons power the language revolution. We turn words into math and teach models to track context and meaning as they read.

The big question

How can a model capture word meaning, hold onto context, and generate fluent language one token at a time?

Explain how text becomes vectors in plain languageInterpret attention as a smart way of choosing contextDescribe the basic workflow of large language models

Why You Should Care

Students need a grounded view: what LLMs do well, where they stumble, and why next-token prediction can still feel smart.

Where this is used today

✓ChatGPT / Claude / Gemini
✓Code autocompletion (GitHub Copilot)
✓Summarizing long documents

Think of it like this

Like an improv actor who never knows the ending, but is so well-read they can riff convincingly on almost anything.

Easy mistake to make

Fluent answers aren’t guaranteed truths. Confidence and correctness are not the same.

By the end, you should be able to say:

Explain next-token prediction in simple language
Connect pretraining, prompting, and fine-tuning
Discuss why fluent output is not the same as guaranteed truth

Think about this first

Why might predicting the next word over and over teach a model more than just grammar?

Words we will keep using

tokenpromptpretrainingfine-tuningsampling

Large Language Models

An LLM is basically a super-powered autocomplete. It reads text, learns the patterns, and predicts what comes next. That sounds simple, but do it enough times on enough data, and "predicting the next word" starts to look a lot like reasoning.

Core training objective
P(x₁, x₂, ..., xₙ) = ∏ P(xₜ | x₁, ..., x₍ₜ₋₁₎)

That training objective sounds small, but it turns out to be enough to teach the model a surprising amount about language and pattern structure.

Next-Token Prediction Demo

When a chatbot writes a poem, it isn't "thinking" of the whole poem at once. It's just rolling the dice on the very next word, over and over again. The magic is that the probabilities are so well-tuned that the result makes sense.

Context (last 2 words):

Next token probabilities (T-adjusted):

Temperature: 1.0

Low temperature makes the model more conservative. High temperature spreads probability more widely and makes outputs less predictable.

the cat

Key Concepts

Pre-training

The Reading Phase. The model reads the internet to learn grammar, facts, and reasoning patterns. It is self-taught.

Fine-tuning & RLHF

The Manners Phase. Humans step in to teach the model how to be helpful, harmless, and follow instructions.

In-context Learning

The "Just Ask" Phase. You don't need to retrain the model to teach it a new trick—just show it an example in the prompt.

Emergent Abilities

The Surprise Factor. When models get big enough, they suddenly learn to do things (like coding) that they weren't explicitly built for.

Scaling Laws

Here is the weirdest thing about LLMs: they are incredibly predictable. If you add more data or make the model bigger, it gets smarter at a very specific, mathematical rate. We can literally graph the "intelligence" (perplexity) before we even build the model.

The exact curve is less important than the message: progress often looked smooth and predictable as models got larger.

The Scaling Timeline

2017Transformer65M

Attention Is All You Need — architecture breakthrough

2018BERT340M

Bidirectional pre-training, fine-tuning paradigm

2019GPT-21.5B

Autoregressive language model, "too dangerous to release"

2020GPT-3175B

Few-shot learning, emergent capabilities at scale

2022ChatGPT / InstructGPT175B+

RLHF alignment, conversational AI goes mainstream

2023GPT-4 / Llama 2~1T?

Multimodal reasoning, open-source competition begins in earnest

2024GPT-4o / Llama 3 / Gemini 1.5~1–2T?

Long context (1M+ tokens), efficiency push, open-source closes the gap

2025o1 / DeepSeek R1 / Claude 3.7 / Llama 4MoE variants

Test-time compute scaling, chain-of-thought reasoning models, DeepSeek disrupts cost assumptions globally

2026GPT-4.5 / Gemini 2.5 / Llama 4 ScoutUnknown

Reasoning-native by default, multimodal as baseline, efficiency over raw parameter count becomes the metric

NLP Evaluation Metrics

Scoring generated text is harder than scoring a yes/no answer. There is usually more than one acceptable sentence, so NLP uses a family of metrics that each capture only part of quality.

Interactive BLEU / ROUGE Calculator

Reference (gold)

thecatsatonthemat

Hypothesis (model output)

thecatissittingonthemat

Green = matched in reference · Red = unmatched

BLEU-1

71.4

unigram precision

BP = 1.000

BLEU-2

59.8

+bigram precision

geo-mean

ROUGE-1

76.9

unigram F1

R=83% P=71%

ROUGE-2

54.5

bigram F1

R=60% P=50%

BLEU precision bias: try making the hypothesis much shorter than the reference. BLEU-1 stays high but BLEU-2 drops — the brevity penalty partially corrects for this.

ROUGE recall bias: now make the hypothesis much longer. ROUGE recall stays high because it measures how much of the reference you covered, regardless of extra words.

Both fail on paraphrase: replace a reference word with a perfect synonym. Both scores drop — neither metric understands meaning, only surface overlap.

Metric	What it measures	Formula core	Best for	Weakness
Perplexity	How surprised the model is by a test corpus	exp(−(1/N) Σ log P(xₜ\|x<t))	LM comparison	Not comparable across tokenisations
BLEU	N-gram precision of hypothesis vs reference(s)	BP · exp(Σ wₙ log pₙ)	Machine translation	Precision-biased; punishes paraphrase
ROUGE-N	N-gram recall of reference covered by hypothesis	matched-ngrams / ref-ngrams	Summarisation	Recall-biased; padding inflates score
ROUGE-L	Longest Common Subsequence F1	LCS(hyp, ref) / len(ref)	Summarisation, fluency	Ignores non-contiguous order quality
METEOR	Alignment F1 with synonym matching & stemming	F_mean · (1 − penalty)	Translation; handles paraphrase	Language-dependent synonym tables
BERTScore	Token cosine similarity via contextual BERT embeddings	P·R·F1 over embedding matches	Any generation; semantic quality	Expensive; model-dependent
chrF	Character n-gram F-score — no word boundary dependence	F1 over char-ngrams	Morphologically rich languages	Less interpretable than word-level

BLEU vs ROUGE in one sentence

BLEU cares more about how much of your output matches the reference. ROUGE cares more about how much of the reference you managed to cover.

The paraphrase problem

If two sentences mean the same thing but use different words, simple overlap metrics can score them unfairly. That is a major weakness to keep in mind.

Perplexity ≠ quality

A model can be good at predicting text patterns and still say false, unsafe, or unhelpful things. Fluent output is not the same as trustworthy output.