The Sequence Research #558: The New Reinforcement Learning from Internal Feedback Allows LLMs to Reason Without External Rewards
The new method from UC Berkeley provides an interesting complement to traditional RLHF methods.
Reinforcement learning has established itself as a key technique to enhance the capabilities of large language models (LLMs), particularly in complex reasoning tasks. Established approaches such as Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) have delivered impressive results, aligning models with human preferences and improving factual correctness through testable reward structures. Yet, both come with intrinsic limitations. RLHF demands labor-intensive and costly human annotation, while RLVR is constrained to domains where answers can be objectively verified via test suites or matching gold-standard outputs.
In response to these limitations, the paper "Learning to Reason without External Rewards" proposes a radically different paradigm: Reinforcement Learning from Internal Feedback (RLIF). This approach enables LLMs to learn from their own internal signals, without relying on any form of external supervision. The central idea is instantiated through a new algorithm called INTUITOR, which uses a model's internal measure of confidence, termed self-certainty, as the only reward signal for policy optimization.