Reinforcement Learning for LLMs

Research on training and improving LLM reasoning capabilities through reinforcement learning, reward modeling, and iterative self-refinement. Covers how RL-based fine-tuning reshapes model behavior, scales inference-time compute, and develops better evaluation methods for reasoning quality.

171 notes (primary) · 562 papers · 9 sub-topics

View as

Reinforcement Learning

16 notes

Can chain-of-thought reasoning emerge during pretraining itself?

Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Can judges that reason about reasoning outperform step classifiers?

Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.

Can adversarial training replace task-specific verifiers for reasoning?

Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.

Can cumulative rewards teach LLMs multi-step decision making?

Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Can extended RL training discover reasoning strategies base models cannot?

Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.

Can machines learn what makes research worth doing?

Can AI systems trained on community citation patterns learn to recognize high-impact research directions the way human scientists do? The research explores whether 'scientific taste'—judgment about what to pursue—is learnable from collective community signals.

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

How does thinking emerge from policy selection in RL?

Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.

Can vanilla PPO match specialized reasoning algorithms with just two techniques?

Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?

RL with Verifiable Rewards (RLVR)

16 notes

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Can breaking down instructions into checklists enable better reinforcement learning?

Explores whether decomposing instruction quality into verifiable yes/no criteria allows RL systems to improve on tasks that lack clear correctness signals, like creative writing or social reasoning.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Can generative reasoning improve process reward model efficiency?

Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.

Do only 20 percent of tokens actually matter for reasoning?

Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Can agents learn to reason better without just chasing rewards?

Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.

Can a single training example unlock mathematical reasoning?

Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.

Can pretraining corpora themselves provide verifiable RL rewards?

Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

What makes rubric-based reward learning resistant to exploitation?

Rubric-based RL systems face reward hacking vulnerabilities. This explores what design patterns, architectural mechanisms, and iterative defenses enable rubrics to remain robust against model exploitation across diverse tasks.

Why do random rewards improve reasoning for some models but not others?

Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Test-Time Compute

16 notes

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Do iterative refinement methods suffer from overthinking?

Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Does revising your own reasoning actually help or hurt?

Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

Can inference compute replace scaling up model size?

Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

What makes test-time training actually work in practice?

Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

How can we predict the optimal thinking token threshold?

Researchers are exploring what determines when a model should stop reasoning on a given task, since accuracy degrades beyond a critical threshold but no principled prediction method exists yet.

Reward Models

10 notes

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Can reasoning RL work without verifying generated answers?

Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?

Training and Fine-Tuning

9 notes

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Can models trained on many imperfect experts outperform each one?

Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Can we decouple what pretraining and fine-tuning each improve?

Does scaling at different training stages produce distinct capability improvements? This matters because it could reveal whether knowledge and behavioral alignment are truly separate properties we can optimize independently.

Does training on AI-generated content permanently degrade model quality?

When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.

Inference-Time Scaling

3 notes

Can models learn to internalize search as reasoning?

Does training on linearized search traces teach models to implement search algorithms internally, expanding what they can discover during reasoning? This matters because it could unlock entirely new problem-solving modes beyond standard chain-of-thought.

Can chain-of-thought reasoning emerge during pretraining itself?

Does gradually tightening token budgets beat fixed budget training?

Can RL training run while generation continues without waiting?

Can judges that reason about reasoning outperform step classifiers?

Can adversarial training replace task-specific verifiers for reasoning?

Can cumulative rewards teach LLMs multi-step decision making?

Can natural language feedback overcome numerical reward plateaus?

Does negative reinforcement alone outperform full reinforcement learning?

Does network depth unlock qualitatively new behaviors in RL?

Can extended RL training discover reasoning strategies base models cannot?

Can machines learn what makes research worth doing?

Can reinforcement learning scale beyond single-turn language tasks?

Does reinforcement learning update only a small fraction of parameters?

Why does SFT-then-RL training follow a predictable three-phase pattern?

How does thinking emerge from policy selection in RL?

Can vanilla PPO match specialized reasoning algorithms with just two techniques?

Why does RLVR training narrow a model's problem solving ability?

Can breaking down instructions into checklists enable better reinforcement learning?

Can adaptive guidance from solution traces reduce reward sparsity in RL?

Can generative reasoning improve process reward model efficiency?

Do only 20 percent of tokens actually matter for reasoning?

Can reasoning emerge from expert demonstrations alone?

Can model confidence alone replace external answer verification?

Can agents learn to reason better without just chasing rewards?

Can a single training example unlock mathematical reasoning?

Can pretraining corpora themselves provide verifiable RL rewards?

Does RLVR actually expand what models can reason about?

Why do reasoning models fail at predicting disagreement?

What makes rubric-based reward learning resistant to exploitation?

Why do random rewards improve reasoning for some models but not others?

Is the exploration-exploitation trade-off actually fundamental?

Why does RLVR work with completely random rewards?

Can we allocate inference compute based on prompt difficulty?

Does step-level confidence outperform global averaging for trace filtering?

Do critique models improve diversity during training itself?

Does extended thinking actually improve reasoning or just increase variance?

Do iterative refinement methods suffer from overthinking?

Why does majority voting outperform more complex inference methods?

Does policy entropy collapse limit reasoning performance in RL?

Does revising your own reasoning actually help or hurt?

Does self-revision actually improve reasoning in language models?

Can self-supervised process rewards replace human annotation?

When does majority-vote reward actually help test-time learning?

Can inference compute replace scaling up model size?

Can models improve themselves using only majority voting?

What makes test-time training actually work in practice?

Why do reasoning models fail differently at training versus inference?

How can we predict the optimal thinking token threshold?

Why does self-correction training on offline data fail?

When should an agent actually stop and deliberate?

Can language models improve themselves without any external training data?

Can model confidence work as a reward signal for reasoning?

Can models improve themselves on tasks without verifiable answers?

Does self-consistency reliably reward correct answers during training?

Does self-generated training data improve model learning?

What limits how much models can improve themselves?

Why do self-improvement loops eventually stop improving?

Can models reliably improve themselves without external feedback?

Can AI systems improve their own learning strategies?

Why do correct code trajectories teach models to tolerate errors?

Can counterfactual invariance eliminate reward hacking biases?

Can diversity optimization improve quality during language model training?

Does training order reshape how models handle different task types?

Does outcome-based RL diversity loss spread across unsolved problems?

Do reward models actually consider what the prompt asks?

Can reward models benefit from reasoning before scoring?

Why does self-rewarding training collapse when responses improve?

Why do reward models ignore what question was asked?

Can reasoning RL work without verifying generated answers?

Can utility-weighted training loss actually harm model performance?

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Can semantic knowledge shift model behavior like reinforcement learning does?

Can models trained on many imperfect experts outperform each one?

Can we train better models on less data?

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Can abstractions guide exploration better than depth alone?

Can we decouple what pretraining and fine-tuning each improve?

Does training on AI-generated content permanently degrade model quality?

Can models learn to internalize search as reasoning?

Does prompt optimization without inference strategy fail?