The Sequence Opinion #815: The End of RLHF? The Rise of Verifiable Rewards
The raise of RLVR. Inside one of the most influential techniques in modern frontier models.
Welcome to this edition of TheSequence! Today, we are exploring a massive architectural shift in how artificial intelligence models are trained, moving away from human bottlenecks and toward autonomous, verifiable reasoning.
Here is a full breakdown of the transition to what many are calling one of the most influential post-training techniques in frontier AI models.
The Ceiling of RLHF
A few years ago, we witnessed the transition to AI-powered software. Instead of humans explicitly writing instructions, we started defining goals and letting backpropagation find the optimal program within the continuous weight space of a neural network. It was a radical shift that completely transformed the software landscape.
However, the standard training pipeline for Large Language Models (LLMs) has recently hit a conceptual ceiling. The prevailing method—pretraining on massive internet corpora followed by Reinforcement Learning from Human Feedback (RLHF)—produces fast, intuitive pattern matchers. These models act as “System 1” thinkers that associate inputs with statistically likely outputs. Unfortunately, they lack the capacity to pause, deliberate, and engage in “System 2” deep reasoning.
The core issue is that RLHF is bottlenecked by humans. Human preference is a remarkably noisy proxy for absolute truth; raters are expensive, slow, and frequently disagree. If a model generates a brilliant but unorthodox mathematical proof, human labelers might reject it simply because they do not understand it. Ultimately, RLHF scales only as fast as human bodies can work, meaning it fundamentally does not scale.

