Neural Architectures Decoded: FFNN, RNN, and Transformers
Feedforward nets, RNNs, and transformers are three different ways of teaching machines to notice pattern: layers for shape, recurrence for memory, and attention for selective focus. This guide compares them without losing the math.
Deterministic layers
Feedforward networks — stacked linear transforms with nonlinear hinges
Imagine stacking transparent sheets on a projector. Each sheet applies a linear transform (rotate, scale) followed by a squishing function (nonlinearity). After enough sheets, the projection can draw almost any boundary in the input space. Backpropagation finds the optimal sheet orientations. The universal approximation theorem guarantees this works in principle; careful regularization makes it work in practice.
Width vs. depth trade-off
Wide networks interpolate (more neurons per layer capture richer features). Deep networks extrapolate (more layers compose higher-order abstractions). ResNet skip connections let you go very deep without vanishing gradients.
Regularization toolkit
Dropout (ensemble of thinned networks), weight decay (L2 Bayesian prior), batch normalization (normalizes internal activations). Combine based on dataset size: small data → heavy regularization.
Decision boundary animation
Add layers one by one to a 2D scatter plot — watch the linear decision boundary curve into complex manifolds. This is the best single visualization for explaining universal approximation to newcomers.
Layer l applies weights W, shifts with bias b, then squishes through σ (ReLU keeps positives, zeros negatives). Animate as a 2D point cloud rotating and reshaping at each depth.
Final layer converts raw logit scores into a probability distribution over classes — animate as a radial bar chart filling toward 1 for the predicted class.
Temporal memory
Recurrent networks — a hidden state that reads the past
The RNN insight: feed the network its own previous output. This creates a rolling memory vector that compresses all prior context into a fixed-size representation. The downside: gradients traveling backward through T steps multiply a weight matrix T times, causing them to vanish (shrink to zero) or explode (grow unboundedly). LSTM and GRU solve this with differentiable gates — learned switches that control what to remember, forget, and output. Visualize the hidden state as a pulsing color bar that evolves token by token.
LSTM gates
Input, forget, and output gates provide differentiable memory control; they learn what to retain vs. erase.
GRU trade-off
Fewer gates, faster inference, slightly less expressive but easier to tune.
Vanilla RNN: new hidden state blends current input and previous state through tanh (−1 to +1). Animate h_t as a bar chart morphing after each token is consumed.
Gradient is a product across T steps — if each factor < 1 the product → 0 (vanishing); if > 1 it → ∞ (exploding). Heatmap each factor across time to show where gradients collapse.
Checklist
- Exploding gradients → clip at 1.0 or switch to gated cells.
- Teacher forcing shortens convergence but hides scheduled sampling debt.
- Profile sequential latency; batching timesteps amortizes framework overhead.
Parallel sequence understanding
Transformers — every token votes on every other token
The Transformer breakthrough: replace sequential state with a single operation that lets any two tokens interact directly, regardless of distance. Each token generates three vectors — a Query (what am I looking for?), a Key (what do I offer?), and a Value (what information to pass?). The attention score between two tokens is their query-key dot product, scaled and softmaxed. The result is fully parallelizable during training and scales predictably with compute — which is why every frontier model today is Transformer-based.
Multi-head attention
Run H attention heads in parallel, each in a d_k = d_model/H subspace. Different heads specialize — some capture syntactic dependencies, others semantic ones. Animate each head as a separate heatmap layer.
KV cache for fast inference
During autoregressive generation, reuse previously computed Key and Value vectors. Without the cache, each token step recomputes the entire sequence — latency scales as O(n²). With the cache, each step is O(n).
Chinchilla scaling law
Compute-optimal training: N parameters require ~20N training tokens. Smaller, well-trained models consistently outperform larger undertrained ones. Plot as a compute-optimal frontier curve.
Scale by √d_k to prevent softmax from saturating in high dimensions. Animate as a heatmap grid — rows = query tokens, columns = key tokens, cell brightness = attention weight.
Per-token feedforward block (applied identically to each position) keeps local nonlinear capacity. Typically 4× wider than d_model.
Sinusoidal positional encoding injects sequence order without learned parameters. Different frequencies encode short- vs. long-range position — animate as wave patterns overlaying the embedding matrix.
Choosing the right architecture: a decision tree
Start from constraints and work toward architecture — not the other way. The three families are mutually exclusive in their core design assumptions (MECE): fixed-input → FFNN, sequential/streaming → RNN, relational/long-context → Transformer. Map your problem to exactly one branch, then optimize within it.
FFNN
Best for tabular or fixed-size signals with limited context. Simple, deterministic, cheap.
RNN
Streaming or small-sequence problems needing temporal awareness without huge hardware.
Transformer
Large context, transfer learning, multimodal modeling. Heavy but unmatched flexibility.
< 10 ms
Latency target
64 → 128k tokens
Context window
Related posts
LLM Fine-Tuning: LoRA, QLoRA, DPO, and Mixture-of-Experts
A base LLM is a general instrument; fine-tuning changes how tightly it resonates with your task. This guide maps the adaptation spectrum from prompting to MoE, with the math behind each trade-off.
13 min readFederated Learning: Training Models Without Moving Data
Federated learning flips the usual gravity of ML: instead of hauling sensitive data to one warehouse, it sends the model out like a traveling teacher and brings back only the lessons. This guide explains the math and the operational trade-offs.
11 min readDevOps to MLOps: Building the Shared Delivery Muscle
DevOps taught teams to ship code like a disciplined factory line; MLOps adds a third moving part, data, and suddenly the factory floor shifts under your feet. This guide shows what transfers cleanly and what breaks.
10 min read