Back to all lessons
Advanced TopicsAdvanced

🎮Reinforcement Learning

Learning by trying, failing, and improving

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

30 min- Explore at your own pace

Before We Begin

What we are learning today

Training by trial and error. The agent acts, earns cookies or zaps, and slowly learns to maximize future cookies. It’s about long-term strategy, not just instant gratification.

How this lesson fits

Here we peek over the horizon: agents that learn by doing, and systems that train together without spilling secrets.

The big question

How can AI learn from its own experience and still respect privacy and real-world limits?

Interpret reward-driven learning and long-term payoffExplain the exploration vs. exploitation balanceDescribe privacy-aware training across many devices

Why You Should Care

This is a different learning setup—no labeled answers. It powers games, robotics, and complex decision-making.

Where this is used today

  • AlphaGo beating human champions
  • Robots learning to walk
  • Optimizing datacenter cooling

Think of it like this

Like training a puppy or mastering a video game—feedback guides better choices over time.

Easy mistake to make

RL isn’t just random flailing. Good RL uses structured feedback to improve choices over time.

By the end, you should be able to say:

  • Identify agent, state, action, and reward
  • Explain Q-values as estimates of future usefulness
  • Describe the exploration-versus-exploitation trade-off

Think about this first

If you coached a robot in a maze, is it enough to tell it only when it wins? What feedback would speed it up?

Words we will keep using

agentstateactionrewardpolicy

Reinforcement Learning

Reinforcement Learning is "learning by doing." No one gives the agent an answer key. It has to try things, fail, get a reward (or a penalty), and figure out the rules on its own. It's how you learned to ride a bike.

AgentThe player. The AI making the choices.
EnvironmentThe game. The world that reacts to the agent.
Policy πThe strategy. The rulebook the agent writes for itself.

Q-Learning

Q-Learning is basically a cheat sheet. The agent keeps a table of every possible situation and writes down a score for every possible move. Good move? Score goes up. Bad move? Score goes down.

Q(s,a)Q(s,a)+α ⁣[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha\!\left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]
α (learning rate): How fast the agent changes its mind.
γ (discount factor): Patience. Does it want the cookie now, or two cookies later?
ε (exploration): Curiosity. How often does it try a random move just to see what happens?

Gridworld Q-Learning Demo

🤖 Agent starts at (0,0). 🏆 Goal at (4,4) → +10. 💀 Trap at (3,3) → -5. 🧱 Walls block movement. Arrows show best action per cell.

Episodes:0
Last reward:0.00
LR (α):0.3
Discount (γ):0.9
Exploration (ε):0.2

Deep RL & Modern Applications

Deep Q-Network (DQN): Uses a neural network instead of a simple table, which makes RL work on larger problems such as video games.
Policy Gradient / PPO: These methods learn the action policy more directly and are common in robotics and complex control tasks.
RLHF: Reinforcement learning from human feedback helps align chatbots and assistants with human preferences.
Bigger picture: RL ideas show up whenever a system must make a series of choices and learn from consequences instead of answer keys.