🔄RNNs & LSTMs
Neural networks with memory
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Networks with memory. They read one word at a time, carrying a running “thought” about the sequence. LSTMs add gates that decide what to keep and what to forget.
How this lesson fits
Time matters. Language, music, weather—they all happen in a sequence. These models learn to remember the past to predict the future.
The big question
How can a model use the past to make sense of what comes next in a sequence?
Why You Should Care
This shows the evolution from explicit probability tables to learned neural memory. It’s the bridge to modern sequence models.
Where this is used today
- ✓Predicting the next word in a sentence
- ✓Stock market time-series analysis
- ✓Music generation
Think of it like this
Like reading a story—you don’t forget the previous sentences every time you see a new word.
Easy mistake to make
LSTMs don’t remember everything forever. They simply manage memory better than basic RNNs.
By the end, you should be able to say:
- Explain hidden state as a form of memory
- Describe why vanilla RNNs struggle with long sequences
- Explain what LSTM gates are trying to control
Think about this first
Why is it tough to make sense of the last word in a sentence if you forgot the first few?
Words we will keep using
Recurrent Neural Networks & LSTMs
Standard neural networks have amnesia—they treat every input as brand new. RNNs have a memory. They read one word at a time, carrying a "thought" forward that summarizes everything they've seen so far.
Simple, but it forgets quickly. Good for short sentences, bad for paragraphs.
The pro version. It has special "gates" that let it choose what to remember and what to forget, so it can track ideas over long distances.
LSTM Gates
"Should I throw away this old memory?"
"Is this new information worth saving?"
"What is the new content I might add?"
"What should I tell the next layer right now?"
This is the real memory line of the LSTM. It is designed to carry useful information farther through time.
Interactive State Trace
Input sequence (drag to change)
HMMs and RNNs — what is the actual difference?
Students often learn HMMs and RNNs as if they belong to different worlds, but they are actually related. Both keep a hidden state that summarizes the past. The big difference is how that state is represented and updated.
What they fundamentally share
Both models rely on the same core idea: the current hidden state should summarize the important parts of the past. That means the next step can be computed from the previous state plus the current input, instead of storing the whole history directly.
In an HMM, the next hidden state comes from a probability table.
In an RNN, the next hidden state comes from learned weights instead of a small probability table.
The three biggest differences
| Axis | HMM | RNN |
|---|---|---|
| Hidden state type | Discrete — Point mass on one of K states | Continuous — A dense vector of arbitrary real values |
| Transition mechanism | Lookup table Fixed stochastic matrix — rows sum to 1 | Learned weight matrix Arbitrary real matrix + nonlinearity |
| Inference at test time | Required — Viterbi or forward pass Must marginalise over K hidden states | None — state is deterministic Just compute the forward pass; h_t IS the state |
| Learning algorithm | Baum-Welch (EM) Expectation over hidden states | Backprop through time (BPTT) Gradient through the unrolled graph |
The key insight
A good way to think about it is this: HMMs use a tidy probability table for transitions, while RNNs replace that table with flexible learned weights. That extra flexibility is why RNNs can model richer patterns.
The cost of that flexibility is interpretability. In an HMM, the hidden state can often be named clearly. In an RNN, the hidden state is a vector, so the meaning is spread across many numbers at once.
Why HMMs need inference
In an HMM, you never directly see the hidden state. So at each step you must reason over several possibilities and keep track of their probabilities.
Why RNNs need no inference
In an RNN, the hidden state is just computed directly. There is no extra uncertainty calculation over several candidate states. You simply run the network forward.
The continuous spectrum
You can think of these models as one family with increasing flexibility. As you move right, the state representation becomes richer and the model becomes better at handling complex sequence patterns.
GRU — Simpler Alternative
A GRU is a lighter version of an LSTM. It uses fewer moving parts, but still tries to control memory with gates.
Modern trend: Transformers now dominate many language tasks, but RNN-style models still matter when streaming, low latency, or limited hardware is important.