Why Less Human = More AI Progress
Reinforcement Learning with verifiable rewards (RLVR) vs. Supervised Fine-Tuning (SFT)
The Problem: Why Traditional AI Training Hits a Wall
Imagine teaching a child math by giving them answers to memorize instead of letting them solve problems. That’s essentially how most AI models are trained today , using supervised fine-tuning (SFT). They’re fed labeled data (questions paired with “correct” answers) and learn to mimic those answers. But this approach has flaws:
- It’s expensive: Curating massive labeled datasets is slow and costly
- It’s rigid: Models overfit to specific examples and fail at new challenges
- It’s biased: Human-labeled data inherits human limitations and blind spots
What if AI could learn like humans do, by trying, failing, and refining strategies autonomously? Enter Reinforcement Learning with verifiable rewards (RLVR) that offers a different approach. The AI interacts with an environment, tries out various actions, and receives rewards or penalties based on how well it does.
However, the question then becomes: How do we ensure the reward signals are correct and trustworthy? That’s where verifiable rewards come in. With verifiable rewards, the system can confirm the correctness of the reward itself , removing the guesswork and potential for bias.
RLVR: The “Self-Taught” AI
RLVR flips the script. Instead of memorizing answers, AI agents learn by interacting with an environment and receiving binary feedback (1 for “correct,” 0 for “incorrect”) based on objective rules.
Think of it as a teacher who only says “yes” or “no” but never gives the answer.
How It Works: Simple Rules, Big Wins
- Define Verifiable Criteria:
- For math problems: Does the final answer match the ground truth?
- For code generation: Does the code pass all test cases?
- For logical reasoning: Is the argument internally consistent?
2. Let the AI Explore:
The model generates multiple solutions (e.g., code snippets or math proofs) and receives instant feedback. Over time, it learns to prioritize strategies that maximize rewards.
# Reward function for code generation
def code_reward(code, test_cases):
try:
exec(code)
if all(test_case passes):
return 1.0 # Perfect!
else:
return 0.0 # Failed tests
except:
return -0.2 # Invalid syntax
Test-Time Compute: Why More “Thinking Time” = Smarter AI
Test-time compute refers to the computational resources allocated during inference (when the AI generates outputs). RLVR leverages this by:
- Generating multiple candidate solutions (e.g., 8–64 variations of a math proof)
- Selecting the best one based on verifiable rewards.
DeepSeek R1’s Breakthrough:
By combining RLVR with test-time compute, DeepSeek R1 achieved state-of-the-art performance on math and coding tasks at a fraction of the cost. For example, a smaller 2B-parameter model fine-tuned with RLVR outperformed a 72B model trained traditionally , all in under 30 minutes and $3
AlphaGo’s Secret Sauce: Self-Play & Scalable Learning
AlphaGo didn’t become a Go champion by studying human games alone. It mastered the game through self-play:
- Play Against Itself: Generate thousands of game variations.
- Learn from Mistakes: Update strategies based on wins/losses.
- Repeat: No human intervention needed.
This mirrors RLVR’s philosophy: Let the AI explore, fail, and refine autonomously. The result? Strategies humans never imagined — like AlphaGo’s infamous “Move 37”
The Future: Autonomous, Generalizable Intelligence
RLVR isn’t just about efficiency , it’s a paradigm shift. It might seem risky to remove human guidance altogether. But in practice, AI systems can benefit from more autonomy, AI systems:
- Learn Faster: No waiting for labeled data.
- Think Differently: Discover solutions outside human intuition (e.g., AlphaGo’s unconventional moves).
- Adapt Broadly: Excel at tasks where “correctness” is definable but answers aren’t obvious (e.g., theorem proving, ethical reasoning)
Unlocking a New Level of Intelligence
By combining RL with verifiable rewards and robust test-time compute, we’re seeing AI:
- Adapt to complex tasks (like strategy games or robotics) at a pace that’s hard to match with supervised methods.
- Achieve incredible accuracy and creativity thanks to tireless self-play and iterative improvement.
- Become more reliable — since the reward signals it relies on are validated, the AI avoids “learning” from mistakes or mislabeled data.
This approach scales easily. As we pour more compute and better reward verification strategies into the mix, the AI keeps improving. This paves the way for visionary applications in areas like healthcare, robotics, and energy management, where reliable, unbiased decision-making is crucial.
In the End
Reinforcement Learning with verifiable rewards stands out as a practical, powerful method to scale AI’s intelligence. Projects like DeepSeek R1 showed us we can achieve tremendous gains at relatively low cost by ensuring the AI is free to explore and learn effectively. AlphaGo demonstrated how AI can surpass human expertise through self-play, given enough compute and a clear sense of what winning looks like.
As we push the boundaries of AI, investing in test-time compute will only become more important, amplifying the impact of RL training. And while Supervised Fine Tuning still has its place, especially for smaller tasks, there’s little doubt that the future of AI — truly innovative and scalable AI — lies in methods that minimize human bias, verify the correctness of rewards, and set the AI free to keep learning.