TheSequence

TheSequence

The Sequence Opinion #730: Reinforcement Learning: a Street-Smart Guide from Go Boards to GPT Alignment

The walkthrough the history of RL.

Oct 02, 2025
∙ Paid

Reinforcement learning (RL) is the part of AI that learns by doing. Not from a teacher with answer keys (supervised learning), and not by free-associating the web (self-supervised pretraining), but by poking the world, seeing what happens, and tweaking itself to do better next time. Think of a curious agent in a loop:

  1. look at the world

  2. pick an action

  3. get a little pat on the back (or a slap on the wrist)

  4. update itself

  5. repeat forever

That’s RL. It’s simultaneously powerful and annoying: powerful because it can discover strategies nobody wrote down; annoying because it’s sample-hungry, finicky, and loves to “hack” whatever score you give it.

Below is a guided tour: where RL came from, the big algorithmic building blocks, the AlphaGo → AlphaZero → MuZero leap, the migration from games to reality, and how today’s frontier models use RL after pretraining (RLHF, RLAIF) to become useful and safe(-ish).

Origins: trial-and-error with a feedback loop

Early psychology noticed a simple rule: actions that lead to good outcomes get repeated. Control theory made it operational: if you can estimate how “good” a situation is, you can plan and optimize decisions over time. Computer science then wrapped this into the agent loop above and asked: how do we learn a good way to act when the world is messy, delayed, and sometimes random?

Two concepts emerged and never went away:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture