Back to all lessons
Language & TransformersAdvanced

🔍Attention & Transformers

How models decide what to focus on

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

35 min- Explore at your own pace

Before We Begin

What we are learning today

Attention is all you need. Transformers view the whole sentence at once, letting each word grab the context that matters most.

How this lesson fits

These lessons power the language revolution. We turn words into math and teach models to track context and meaning as they read.

The big question

How can a model capture word meaning, hold onto context, and generate fluent language one token at a time?

Explain how text becomes vectors in plain languageInterpret attention as a smart way of choosing contextDescribe the basic workflow of large language models

Why You Should Care

Transformers power modern language models. Understanding attention demystifies how they hold context.

Where this is used today

  • Google Translate
  • BERT for search understanding
  • Protein folding prediction (AlphaFold)

Think of it like this

Like highlighting key lines in a textbook. You skim the filler but zero in on the important parts to build meaning.

Easy mistake to make

Attention isn’t consciousness. It’s a math trick for weighting information from different positions.

By the end, you should be able to say:

  • Explain attention as weighted focus across tokens
  • Describe the roles of queries, keys, and values at a high level
  • Explain why transformers handle context better than older sequence models

Think about this first

In “The animal did not cross the street because it was tired,” what does “it” refer to, and which words help you decide?

Words we will keep using

attentiontokenquerykeyvalue

From Text to Vectors

A transformer doesn't read like you do. First, it chops text into "tokens" (pieces of words). Then it turns those tokens into ID numbers, and finally into list of coordinates (vectors). Only then can it start doing math on meaning.

Raw text“Hello world”
Tokens
Helloworld
Token IDs
75922088
Embeddings
0.2-0.50.8
0.90.3-0.2
+Pos Enc
0.3-0.40.9
0.80.4-0.1
Encoder ×NMHA + FFN
repeated

The Encoder Block

Input Embeddings+Positional EncodingMulti-HeadSelf-AttentionAdd & LayerNormFeed-ForwardNetwork (×4 wider)Add & LayerNormresidualresidualOutput

This is the engine room. Every token looks at every other token to figure out context, then processes that information through a private neural network. It does this dozens of times in a row.

Multi-Head Self-Attention
The "social" step. Each word asks: "Who else in this sentence helps explain me?"
Feed-Forward Network
The "thinking" step. The word digests what it learned from its neighbors and updates its own meaning.
Residual + LayerNorm
The stabilizer. These connections keep the signal clean so the network doesn't crash during training.

Scaled Dot-Product Attention — Step by Step

This is the secret sauce. A token asks a question (Query), matches it against others (Key), and if they match, it absorbs information (Value). It's a soft, fuzzy lookup table.

Attention(Q,K,V)=softmax ⁣(QK ⁣dk)V\text{Attention}(Q,K,V)=\text{softmax}\!\left(\tfrac{QK^{\!\top}}{\sqrt{d_k}}\right)V
Q (Query): "What am I looking for?"
K (Key): "What do I contain?"
V (Value): "Here is my content."

Tokens: Thinking, Machines — d_k=3 (simplified). Each token's embedding is linearly projected into Q, K, V:

Q (Queries)
Thinking
0.90
0.20
0.50
Machines
0.30
0.80
0.40
K (Keys)
Thinking
0.70
0.40
0.60
Machines
0.20
0.90
0.30
V (Values)
Thinking
0.50
0.80
0.20
Machines
0.70
0.10
0.90
Scores = QKᵀ ÷ √3
Thinking
0.58
0.29
Machines
0.44
0.52
Attention = softmax(Scores)
Thinking
0.57
0.43
Machines
0.48
0.52
Output = Attention × V
Thinking
0.59
0.50
0.50
Machines
0.60
0.44
0.56

Each token's new representation is a weighted blend of all value vectors.

Interactive Visualizations

The animal did not cross the street because it was too tired

Click a purple word to see its attention distribution. Darker orange = higher weight.

Watch how one word pulls information from the words that help explain it. That is the practical power of attention.

Transformers vs RNNs

PropertyRNN/LSTMTransformer
Parallelism❌ Token-by-token✅ All tokens at once
Long-range⚠️ Vanishing gradient✅ Direct O(1) path between any pair
Training speed🐢 Slow on GPU🚀 Highly parallelisable
Scalability⚠️ Plateaus early✅ Scales → foundation of LLMs

The big story is simple: transformers handle long-range relationships and parallel training much better, which is why they became the foundation for modern LLMs.