What does reward learning actually do to model reasoning? · Gravity7

What RLVR Actually Does

4 notes

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Can a single training example unlock mathematical reasoning?

Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.

Why do random rewards improve reasoning for some models but not others?

Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Exploration and Entropy Dynamics

3 notes

Do only 20 percent of tokens actually matter for reasoning?

Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Training Efficiency

1 note

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Extending RLVR Beyond Math/Code

4 notes

Can breaking down instructions into checklists enable better reinforcement learning?

Explores whether decomposing instruction quality into verifiable yes/no criteria allows RL systems to improve on tasks that lack clear correctness signals, like creative writing or social reasoning.

What makes rubric-based reward learning resistant to exploitation?

Rubric-based RL systems face reward hacking vulnerabilities. This explores what design patterns, architectural mechanisms, and iterative defenses enable rubrics to remain robust against model exploitation across diverse tasks.

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Process Reward Models

1 note

Can generative reasoning improve process reward model efficiency?

Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.

Novel Architecture

1 note

Can pretraining corpora themselves provide verifiable RL rewards?

Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?

Metacognitive Process Supervision

1 note

Can agents learn to reason better without just chasing rewards?

Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.

RLVR Side Effects

3 notes

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?

Related Areas

7 notes

What actually changes inside a model during RL training?

RL training modifies only sparse regions of model parameters through suppression of incorrect paths rather than broad capability building. Understanding these mechanics reveals how fine-tuning shapes reasoning and what hidden costs accompany optimization.

How well do reward models actually evaluate reasoning?

Can systems that judge AI reasoning be trusted to work reliably, or do they fail in systematic ways? This matters because flawed evaluators can't improve the systems they train.

What makes chain-of-thought reasoning actually work?

Explores how reasoning traces are structured, what components they rely on, and the specific conditions under which they break down or fail to generalize beyond training patterns.

How does reinforcement learning reshape what models can reason about?

RL training modifies model parameters and exploration strategies, but what capabilities does it actually unlock versus degrade? This map explores RL mechanics, reward dynamics, and the hidden costs of optimization.

How does reinforcement learning reshape what models can reason about?

RL training modifies model parameters and exploration strategies, but what capabilities does it actually unlock versus degrade? This map explores RL mechanics, reward dynamics, and the hidden costs of optimization.

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.