Reinforcement Learning for LLMs

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? How does reinforcement learning reshape what models can reason about?

Post angle for Medium. The counterintuitive finding: RLVR — the training method behind o1, R1, and the reasoning revolution — works nearly as well with random, incorrect, or format-only rewards as it does with correct rewards. For specific model families, the reward signal barely matters.

The hook: Random rewards yield 21.4% improvement. Incorrect labels yield 24.6%. One training example yields a 37% jump. These numbers shouldn't be possible if RLVR were teaching reasoning. They make perfect sense if RLVR is activating reasoning that already exists.

Three converging lines of evidence:

  1. Spurious rewards — Qwen2.5-Math improves nearly as much with random/wrong/format rewards as ground truth. But Llama3 and OLMo2 fail with the same spurious rewards. The variable is pretraining, not reward. Qwen's "code reasoning" strategy surfaces under any optimization pressure.

  2. 1-shot RLVR — One training example raises MATH500 from 36% to 73.6%. Post-saturation generalization: the model perfectly memorizes its single example but keeps improving on test problems for 1,400 more steps. The data is exhausted but the learning continues.

  3. Capability boundary collapse — pass@k analysis shows RLVR models are subsets of base model capability. At high k, base models outperform RLVR models. RLVR narrows the distribution toward reliability, not toward new capability. Six RLVR algorithms all converge on the same subset.

The synthesis: RLVR is not a teacher. It is a catalyst that triggers a phase transition in the model's output distribution — from exploring all of its pretraining space to efficiently sampling from the subset that produces correct answers. The specific catalyst (reward type, training data volume) matters far less than the quality of the pretraining space being catalyzed.

The practical stakes: If RLVR effectiveness is determined by pretraining, then the massive investments in reward engineering, verification infrastructure, and training data curation for RLVR may be misallocated. The real investment should be in pretraining data composition — the foundation that RLVR can only activate, not create.

Connects to: "RL teaches when not how" thesis, the knowing-doing gap, the self-improvement mirage


Source: RLVR

Related concepts in this collection

Concept map
13 direct connections · 101 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

the reward that doesnt matter — why rlvr works even when the reward signal is wrong