Back to all lessons
Language & TransformersIntermediate

📝Embeddings & Word2Vec

How words become meaningful vectors

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

25 min- Explore at your own pace

Before We Begin

What we are learning today

Words become numbers. In this space, “King” - “Man” + “Woman” lands near “Queen,” and synonyms become neighbors.

How this lesson fits

These lessons power the language revolution. We turn words into math and teach models to track context and meaning as they read.

The big question

How can a model capture word meaning, hold onto context, and generate fluent language one token at a time?

Explain how text becomes vectors in plain languageInterpret attention as a smart way of choosing contextDescribe the basic workflow of large language models

Why You Should Care

Embeddings are the bridge between language and geometry. They make modern language tech possible.

Where this is used today

  • Semantic search (finding "dog" when you search "puppy")
  • Recommendation systems
  • Language translation alignment

Think of it like this

It’s a map of meaning. Synonyms share a neighborhood; opposites live across town. Distance signals relatedness.

Easy mistake to make

Embeddings reflect patterns in training text, biases included—they’re not perfect dictionaries.

By the end, you should be able to say:

  • Explain why words must be converted into numbers
  • Describe what it means for similar words to be close in vector space
  • Summarize the idea behind skip-gram training

Think about this first

If “king” and “queen” are related, how might a computer discover that from text alone?

Words we will keep using

embeddingvectorcontextsimilarityskip-gram

Vector Space Semantics

Imagine if words were places on a map. "King" and "Queen" would live next door. "Apple" and "Banana" would be down the street. This is what embeddings do: they turn meaning into geometry.

Classic analogy
king - man + womanqueen

In the explorer below, you are looking at Word2Vec embeddings that were originally far larger. We squash them down to 3D so you can move around them and notice that language begins to form neighborhoods.

Word2Vec 10K

PCA 3D Projection

Initialising…

Fetch labels
Fetch vectors
Parse
Center
Covariance
PCA (3D)
Project

How it works: Skip-gram

The training is surprisingly simple: Pick a word, and ask the model to guess its neighbors. Do this billions of times. Words that appear in similar contexts will naturally drift closer together in vector space.

Training Objective

The formal goal says: given the center word wtw_t, make the nearby context words wt+jw_{t+j} as predictable as possible.

J(θ)=1Tt=1Tcjc,j0logp(wt+jwt)J(\theta) = \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \le j \le c, j \neq 0} \log p(w_{t+j} | w_t)

Where p(wOwI)p(w_O|w_I) is defined by the softmax of the dot product:

p(wOwI)=exp(vwOvwI)w=1Vexp(vwvwI)p(w_O|w_I) = \frac{\exp(v'_{w_O}{}^\top v_{w_I})}{\sum_{w=1}^V \exp(v'_w{}^\top v_{w_I})}

Cosine Similarity

If two arrows point in the same direction, the words are related. If they point in different directions, they are unrelated. It's that simple.

similarity=cos(θ)=ABAB\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}

Try searching for “good” in the explorer above and inspect the nearest neighbors. That is where the abstract idea suddenly starts to feel real.