Reinforcement Learning for LLMs

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think? What does reward learning actually do to model reasoning?

The strongest empirical challenge to the "RL teaches reasoning" narrative comes from pass@k analysis. At small k (e.g., k=1), RLVR models outperform their base models — they produce correct answers more reliably on any given attempt. But as k increases, base models consistently surpass RLVR models across all benchmarks and model families. The reasoning paths that RLVR models generate are already present in the base model's sampling distribution.

This reframes what RLVR actually does. Rather than expanding the frontier of solvable problems, RLVR narrows the sampling distribution toward correct solutions that were already accessible. The model learns to find correct paths more efficiently, not to reason in fundamentally new ways. Manual inspection confirms: for most problems where RLVR models succeed, the base model can produce at least one correct chain-of-thought.

Six popular RLVR algorithms (including GRPO, PPO variants) perform similarly and all remain far from optimal in leveraging the base model's potential — they converge on similar subsets of the base model's capability space. This suggests the bottleneck is not algorithmic but structural: on-policy RL with verifiable rewards optimizes sampling, not capability.

The contrast with distillation is sharp. Distillation from a stronger teacher can transfer genuinely new reasoning patterns, expanding the student's reasoning scope beyond what the base model could sample. Since Does RL teach reasoning or just when to use it?, the RLVR finding fits: activation is not creation. But distillation is creation — it writes new patterns into the model's distribution.

The practical implication: if you need capabilities the base model doesn't have, distillation from a stronger model is the path. If the base model can already solve the problem (given enough samples), RLVR makes it reliable. These are different tools for different gaps.


Source: RLVR

Related concepts in this collection

Concept map
18 direct connections · 147 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rlvr does not expand reasoning capability boundaries beyond the base model — it improves sampling efficiency within existing boundaries