🎮Reinforcement Learning
Learning by trying, failing, and improving
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Training by trial and error. The agent acts, earns cookies or zaps, and slowly learns to maximize future cookies. It’s about long-term strategy, not just instant gratification.
How this lesson fits
Here we peek over the horizon: agents that learn by doing, and systems that train together without spilling secrets.
The big question
How can AI learn from its own experience and still respect privacy and real-world limits?
Why You Should Care
This is a different learning setup—no labeled answers. It powers games, robotics, and complex decision-making.
Where this is used today
- ✓AlphaGo beating human champions
- ✓Robots learning to walk
- ✓Optimizing datacenter cooling
Think of it like this
Like training a puppy or mastering a video game—feedback guides better choices over time.
Easy mistake to make
RL isn’t just random flailing. Good RL uses structured feedback to improve choices over time.
By the end, you should be able to say:
- Identify agent, state, action, and reward
- Explain Q-values as estimates of future usefulness
- Describe the exploration-versus-exploitation trade-off
Think about this first
If you coached a robot in a maze, is it enough to tell it only when it wins? What feedback would speed it up?
Words we will keep using
Reinforcement Learning
Reinforcement Learning is "learning by doing." No one gives the agent an answer key. It has to try things, fail, get a reward (or a penalty), and figure out the rules on its own. It's how you learned to ride a bike.
Q-Learning
Q-Learning is basically a cheat sheet. The agent keeps a table of every possible situation and writes down a score for every possible move. Good move? Score goes up. Bad move? Score goes down.
Gridworld Q-Learning Demo
🤖 Agent starts at (0,0). 🏆 Goal at (4,4) → +10. 💀 Trap at (3,3) → -5. 🧱 Walls block movement. Arrows show best action per cell.