🔍Attention & Transformers
How models decide what to focus on
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Attention is all you need. Transformers view the whole sentence at once, letting each word grab the context that matters most.
How this lesson fits
These lessons power the language revolution. We turn words into math and teach models to track context and meaning as they read.
The big question
How can a model capture word meaning, hold onto context, and generate fluent language one token at a time?
Why You Should Care
Transformers power modern language models. Understanding attention demystifies how they hold context.
Where this is used today
- ✓Google Translate
- ✓BERT for search understanding
- ✓Protein folding prediction (AlphaFold)
Think of it like this
Like highlighting key lines in a textbook. You skim the filler but zero in on the important parts to build meaning.
Easy mistake to make
Attention isn’t consciousness. It’s a math trick for weighting information from different positions.
By the end, you should be able to say:
- Explain attention as weighted focus across tokens
- Describe the roles of queries, keys, and values at a high level
- Explain why transformers handle context better than older sequence models
Think about this first
In “The animal did not cross the street because it was tired,” what does “it” refer to, and which words help you decide?
Words we will keep using
From Text to Vectors
A transformer doesn't read like you do. First, it chops text into "tokens" (pieces of words). Then it turns those tokens into ID numbers, and finally into list of coordinates (vectors). Only then can it start doing math on meaning.
repeated
The Encoder Block
This is the engine room. Every token looks at every other token to figure out context, then processes that information through a private neural network. It does this dozens of times in a row.
The "social" step. Each word asks: "Who else in this sentence helps explain me?"
The "thinking" step. The word digests what it learned from its neighbors and updates its own meaning.
The stabilizer. These connections keep the signal clean so the network doesn't crash during training.
Scaled Dot-Product Attention — Step by Step
This is the secret sauce. A token asks a question (Query), matches it against others (Key), and if they match, it absorbs information (Value). It's a soft, fuzzy lookup table.
Tokens: Thinking, Machines — d_k=3 (simplified). Each token's embedding is linearly projected into Q, K, V:
Each token's new representation is a weighted blend of all value vectors.
Interactive Visualizations
“The animal did not cross the street because it was too tired”
Click a purple word to see its attention distribution. Darker orange = higher weight.
Watch how one word pulls information from the words that help explain it. That is the practical power of attention.
Transformers vs RNNs
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Parallelism | ❌ Token-by-token | ✅ All tokens at once |
| Long-range | ⚠️ Vanishing gradient | ✅ Direct O(1) path between any pair |
| Training speed | 🐢 Slow on GPU | 🚀 Highly parallelisable |
| Scalability | ⚠️ Plateaus early | ✅ Scales → foundation of LLMs |
The big story is simple: transformers handle long-range relationships and parallel training much better, which is why they became the foundation for modern LLMs.