Sequence ModelsIntermediate

🔄RNNs & LSTMs

Neural networks with memory

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min

Pause and experiment as you go.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Networks with memory. They read one word at a time, carrying a running “thought” about the sequence. LSTMs add gates that decide what to keep and what to forget.

How this lesson fits

Time matters. Language, music, weather—they all happen in a sequence. These models learn to remember the past to predict the future.

The big question

How can a model use the past to make sense of what comes next in a sequence?

Explain why sequence order changes meaningCompare probabilistic and neural approaches to sequencesTrack memory and hidden state across time

Why You Should Care

This shows the evolution from explicit probability tables to learned neural memory. It’s the bridge to modern sequence models.

Where this is used today

✓Predicting the next word in a sentence
✓Stock market time-series analysis
✓Music generation

Think of it like this

Like reading a story—you don’t forget the previous sentences every time you see a new word.

Easy mistake to make

LSTMs don’t remember everything forever. They simply manage memory better than basic RNNs.

By the end, you should be able to say:

Explain hidden state as a form of memory
Describe why vanilla RNNs struggle with long sequences
Explain what LSTM gates are trying to control

Think about this first

Why is it tough to make sense of the last word in a sentence if you forgot the first few?

Words we will keep using

sequencehidden stategatememoryvanishing gradient

Recurrent Neural Networks & LSTMs

Standard neural networks have amnesia—they treat every input as brand new. RNNs have a memory. They read one word at a time, carrying a "thought" forward that summarizes everything they've seen so far.

Vanilla RNN

h_t = \tanh(W_h h_{t-1} + W_x x_t + b)

Simple, but it forgets quickly. Good for short sentences, bad for paragraphs.

LSTM (Long Short-Term Memory)

The pro version. It has special "gates" that let it choose what to remember and what to forget, so it can track ideas over long distances.

LSTM Gates

Forget gate

f_t = \sigma(W_f[h_{t-1}, x_t] + b_f)

"Should I throw away this old memory?"

Input gate

i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)

"Is this new information worth saving?"

Candidate

\tilde{c}_t = \tanh(W_g[h_{t-1}, x_t] + b_g)

"What is the new content I might add?"

Output gate

o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)

"What should I tell the next layer right now?"

Cell state update:

c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

This is the real memory line of the LSTM. It is designed to carry useful information farther through time.

Hidden state:

h_t = o_t \odot \tanh(c_t)

Interactive State Trace

Input sequence (drag to change)

0.5x0-0.3x10.8x20.2x3-0.6x40.4x50.9x6-0.2x7

HMMs and RNNs — what is the actual difference?

Students often learn HMMs and RNNs as if they belong to different worlds, but they are actually related. Both keep a hidden state that summarizes the past. The big difference is how that state is represented and updated.

What they fundamentally share

Both models rely on the same core idea: the current hidden state should summarize the important parts of the past. That means the next step can be computed from the previous state plus the current input, instead of storing the whole history directly.

HMM — state update

P(z_t \mid z_{t-1}) = A_{z_{t-1},\, z_t}

In an HMM, the next hidden state comes from a probability table.

RNN — state update

h_t = \tanh(W_h\, h_{t-1} + W_x\, x_t + b)

In an RNN, the next hidden state comes from learned weights instead of a small probability table.

The three biggest differences

Axis	HMM	RNN
Hidden state type	Discrete — $z_t \in \{1,\ldots,K\}$ Point mass on one of K states	Continuous — $h_t \in \mathbb{R}^d$ A dense vector of arbitrary real values
Transition mechanism	Lookup table $A \in \mathbb{R}^{K \times K}$ Fixed stochastic matrix — rows sum to 1	Learned weight matrix $W_h \in \mathbb{R}^{d \times d}$ Arbitrary real matrix + nonlinearity
Inference at test time	Required — Viterbi or forward pass Must marginalise over K hidden states	None — state is deterministic Just compute the forward pass; h_t IS the state
Learning algorithm	Baum-Welch (EM) Expectation over hidden states	Backprop through time (BPTT) Gradient through the unrolled graph

The key insight

A good way to think about it is this: HMMs use a tidy probability table for transitions, while RNNs replace that table with flexible learned weights. That extra flexibility is why RNNs can model richer patterns.

The cost of that flexibility is interpretability. In an HMM, the hidden state can often be named clearly. In an RNN, the hidden state is a vector, so the meaning is spread across many numbers at once.

Why HMMs need inference

In an HMM, you never directly see the hidden state. So at each step you must reason over several possibilities and keep track of their probabilities.

Why RNNs need no inference

In an RNN, the hidden state is just computed directly. There is no extra uncertainty calculation over several candidate states. You simply run the network forward.

The continuous spectrum

HMM→Soft HMM (continuous states)→Linear RNN (no nonlinearity)→Vanilla RNN→LSTM / GRU

You can think of these models as one family with increasing flexibility. As you move right, the state representation becomes richer and the model becomes better at handling complex sequence patterns.

GRU — Simpler Alternative

A GRU is a lighter version of an LSTM. It uses fewer moving parts, but still tries to control memory with gates.

Modern trend: Transformers now dominate many language tasks, but RNN-style models still matter when streaming, low latency, or limited hardware is important.