Neural Architectures Decoded: FFNN, RNN, and Transformers
Back to blog
Neural DesignDeep Learning

Neural Architectures Decoded: FFNN, RNN, and Transformers

Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.

10 min readMarch 17, 2026
FFNNRNNTransformersAttention

Deterministic layers

Feedforward networks — stacked linear transforms with nonlinear hinges

Imagine stacking transparent sheets on a projector. Each sheet applies a linear transform (rotate, scale) followed by a squishing function (nonlinearity). After enough sheets, the projection can draw almost any boundary in the input space. Backpropagation finds the optimal sheet orientations. The universal approximation theorem guarantees this works in principle; careful regularization makes it work in practice.

Capacity

Width vs. depth trade-off

Wide networks interpolate (more neurons per layer capture richer features). Deep networks extrapolate (more layers compose higher-order abstractions). ResNet skip connections let you go very deep without vanishing gradients.

Stability

Regularization toolkit

Dropout (ensemble of thinned networks), weight decay (L2 Bayesian prior), batch normalization (normalizes internal activations). Combine based on dataset size: small data → heavy regularization.

Visual

Decision boundary animation

Add layers one by one to a 2D scatter plot — watch the linear decision boundary curve into complex manifolds. This is the best single visualization for explaining universal approximation to newcomers.

h(l)=σ ⁣(W(l)h(l1)+b(l))h^{(l)} = \sigma\!\left(W^{(l)}\,h^{(l-1)} + b^{(l)}\right)

Layer l applies weights W, shifts with bias b, then squishes through σ (ReLU keeps positives, zeros negatives). Animate as a 2D point cloud rotating and reshaping at each depth.

y^=softmax ⁣(W(L)h(L1)+b(L))\hat{y} = \text{softmax}\!\left(W^{(L)} h^{(L-1)} + b^{(L)}\right)

Final layer converts raw logit scores into a probability distribution over classes — animate as a radial bar chart filling toward 1 for the predicted class.

Temporal memory

Recurrent networks — a hidden state that reads the past

The RNN insight: feed the network its own previous output. This creates a rolling memory vector that compresses all prior context into a fixed-size representation. The downside: gradients traveling backward through T steps multiply a weight matrix T times, causing them to vanish (shrink to zero) or explode (grow unboundedly). LSTM and GRU solve this with differentiable gates — learned switches that control what to remember, forget, and output. Visualize the hidden state as a pulsing color bar that evolves token by token.

Gated

LSTM gates

Input, forget, and output gates provide differentiable memory control; they learn what to retain vs. erase.

Lean

GRU trade-off

Fewer gates, faster inference, slightly less expressive but easier to tune.

ht=tanh ⁣(Wxhxt+Whhht1+bh)h_t = \tanh\!\left(W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h\right)

Vanilla RNN: new hidden state blends current input and previous state through tanh (−1 to +1). Animate h_t as a bar chart morphing after each token is consumed.

Lh0=t=1Ththt1\frac{\partial \mathcal{L}}{\partial h_0} = \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}

Gradient is a product across T steps — if each factor < 1 the product → 0 (vanishing); if > 1 it → ∞ (exploding). Heatmap each factor across time to show where gradients collapse.

Checklist

  • Exploding gradients → clip at 1.0 or switch to gated cells.
  • Teacher forcing shortens convergence but hides scheduled sampling debt.
  • Profile sequential latency; batching timesteps amortizes framework overhead.

Parallel sequence understanding

Transformers — every token votes on every other token

The Transformer breakthrough: replace sequential state with a single operation that lets any two tokens interact directly, regardless of distance. Each token generates three vectors — a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information to pass?). The attention score between two tokens is their query-key dot product, scaled and softmaxed. The result is fully parallelizable during training and scales predictably with compute — which is why every frontier model today is Transformer-based.

Architecture

Multi-head attention

Run H attention heads in parallel, each in a d_k = d_model/H subspace. Different heads specialize — some capture syntactic dependencies, others semantic ones. Animate each head as a separate heatmap layer.

Serving

KV cache for fast inference

During autoregressive generation, reuse previously computed Key and Value vectors. Without the cache, each token step recomputes the entire sequence — latency scales as O(n²). With the cache, each step is O(n).

Scaling

Chinchilla scaling law

Compute-optimal training: N parameters require ~20N training tokens. Smaller, well-trained models consistently outperform larger undertrained ones. Plot as a compute-optimal frontier curve.

Attn(Q,K,V)=softmax ⁣(QKdk)V\text{Attn}(Q,K,V) = \text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V

Scale by √d_k to prevent softmax from saturating in high dimensions. Animate as a heatmap grid — rows = query tokens, columns = key tokens, cell brightness = attention weight.

FFN(x)=GELU(xW1+b1)W2+b2\text{FFN}(x) = \text{GELU}(xW_1 + b_1)\,W_2 + b_2

Per-token feedforward block (applied identically to each position) keeps local nonlinear capacity. Typically 4× wider than d_model.

PE(pos,2i)=sin ⁣(pos100002i/d)\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)

Sinusoidal positional encoding injects sequence order without learned parameters. Different frequencies encode short- vs. long-range position — animate as wave patterns overlaying the embedding matrix.

Choosing the right architecture: a decision tree

Start from constraints and work toward architecture — not the other way. The three families are mutually exclusive in their core design assumptions (MECE): fixed-input → FFNN, sequential/streaming → RNN, relational/long-context → Transformer. Map your problem to exactly one branch, then optimize within it.

Static

FFNN

Best for tabular or fixed-size signals with limited context. Simple, deterministic, cheap.

Sequential

RNN

Streaming or small-sequence problems needing temporal awareness without huge hardware.

Context-rich

Transformer

Large context, transfer learning, multimodal modeling. Heavy but unmatched flexibility.

< 10 ms

Latency target

64 → 128k tokens

Context window