What makes a task at the edge of competence optimal for RL?
This explores why RL works best on tasks a model can *sometimes* solve — not the easy ones it always gets, nor the hard ones it never does — and what the corpus says makes that middle band the productive zone.
This explores why RL works best on tasks a model can *sometimes* solve — and the corpus offers a surprisingly clean reason. Several lines of work converge on the idea that RL doesn't teach new capabilities so much as it sharpens deployment of capabilities the base model already has. Pass@k analysis shows base models actually *outperform* their RL-trained versions at high sampling budgets, meaning RL narrows the model toward solutions already living in its distribution rather than expanding the set of solvable problems Does RLVR actually expand what models can reason about?. Related work frames verifiable rewards as catalysts that surface pretraining strategies, with updates that are structurally sparse and bounded by the prior How does RL training reshape reasoning and what gets lost?, and argues RL teaches a model *when* to reason rather than *how* Does RL post-training create reasoning or just deploy it?.
Sources 7 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Task-oriented RL incentivizes premature exploitation of prior knowledge. Training exploration and execution as distinct objectives with separate verifiable rewards yields better downstream performance.