Training, RL, and Test-Time Scaling

Research on how language models are trained and improved — reinforcement learning and RLVR, reward models, fine-tuning, self-refinement and feedback, and inference- and test-time scaling.

208 notes (primary) · 636 papers · 9 sub-topics

View as

Reinforcement Learning

30 notes

Can LLMs design reward functions for reinforcement learning?

Can language models help automate the notoriously difficult task of designing reward shaping functions for sparse-reward RL, and if so, how might we structure that collaboration to work around LLMs' weaknesses in stochastic control?

Can an agent's own beliefs guide credit assignment without critics?

Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.

Can chain-of-thought reasoning be learned during pretraining itself?

Explores whether reasoning emerges more effectively when models treat thinking as an exploratory action during next-token prediction, rather than only after pretraining through reinforcement learning.

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Can judges that reason about reasoning outperform classifier rewards?

Can process reward models generate explanations about why steps are correct rather than simply classifying them? This explores whether meta-reasoning about reasoning improves both accuracy and generalization in step-level evaluation.

Can adversarial critics replace task-specific verifiers for reasoning?

Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.

Can models express calibrated confidence in long-form text?

Can language models be trained to emit extended passages with confidence statements that actually help readers make accurate probabilistic predictions? This matters because confident hallucinations mislead users into bad decisions.

How should multiple reward objectives be weighted during training?

When training on multiple objectives at once, how can we automatically balance their contributions without manual tuning? This explores whether reward variance within rollouts reveals which objectives carry real learning signal.

Can full episode rewards per step enable better credit assignment?

Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Can online LLM feedback improve direct preference optimization during training?

Direct alignment methods like DPO use fixed preference data from older models, creating off-policy training. Could sampling fresh responses from the current model and using an LLM judge to pick preferences in real time reduce overfitting and improve alignment?

Can general process reward models catch factual errors in finance?

General process reward models assess logical coherence but may miss factual hallucinations in high-stakes domains like finance. Does domain specialization with knowledge grounding improve accuracy where logical flow alone fails?

Can reinforcement learning discover reasoning strategies base models cannot?

Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.

Can reinforcement learning improve models during general pretraining?

Can RL work during standard pretraining on unverified text like Wikipedia, without reward models or labeled data? This matters because it would remove the data bottleneck that currently limits RL-based training to small verified domains.

Can models learn what makes research worth doing?

Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Can environment feedback replace scalar rewards in policy learning?

Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Can models learn to judge themselves without external rewards?

Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Can a single reward model represent diverse human preferences?

Standard RLHF assumes one shared preference signal. But what happens when human values genuinely conflict? This question explores whether aggregating preferences into one model fundamentally fails at fairness.

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Can two simple techniques match complex RL algorithms?

Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.

Can reward vectors be the hidden source of solution diversity?

Standard RL collapses multi-dimensional rewards into scalars before training, losing the natural structure that could drive diverse specialization. What if that vector structure itself is the diversity axis?

Can language models replace reward models with internal signals?

Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?

Should training maximize diversity when models feed into search?

If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?

RL with Verifiable Rewards (RLVR)

21 notes

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Can breaking down instructions into checklists improve AI reward signals?

Exploring whether decomposing subjective instruction quality into verifiable yes/no criteria enables reinforcement learning on tasks without clear correctness signals, like writing and reasoning.

What reasoning features does each difficulty level reinforce?

When models train on problems of different difficulty, do they build the same internal reasoning machinery or different kinds? This matters because accuracy gains alone hide what's actually being learned.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Can generative reasoning beat discriminative models with less training data?

Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.

Do high-entropy tokens drive reasoning model improvements?

Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Can RL agents learn to reason better, not just succeed?

Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?

Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Do overly hard RLVR samples actually harm model capabilities?

Explores whether training on problems beyond a model's competence band causes active regression rather than mere learning failures. Investigates whether group-relative normalization amplifies accidental successes into harmful shortcuts.

Can next-token prediction become a reasoning task with RL?

Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

How can rubric-based rewards resist reward hacking attacks?

Single rubrics are easily exploited by models, and simply adding more rubrics yields diminishing returns. What design patterns and defensive mechanisms actually prevent reward hacking in rubric-based RL systems?

Why do medium-difficulty problems teach reasoning better than hard ones?

Does harder always mean better for learning? This explores why easy and extremely hard samples produce weak training signals in RLVR, while medium-difficulty problems drive the strongest improvements.

How does model ability change what samples teach?

Does a sample's learning value stay fixed, or does it shift as the model improves? Understanding whether informativeness is a moving target could explain why fixed difficulty filters underperform adaptive ones during training.

Why do random rewards improve reasoning for some models but not others?

When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?

Why do RL agents exploit before exploring enough?

Standard task-oriented RL rewards immediate task completion over environment discovery. This may systematically under-train the exploration skills needed for unfamiliar environments.

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

Training and Fine-Tuning

16 notes

Does sequencing imitation then exploration training improve reasoning?

Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Does fine-tuning on new facts increase hallucination risk?

When LLMs learn unfamiliar facts through fine-tuning, do they become more prone to hallucinating about things they already knew? Understanding this matters for safe knowledge updates.

Does repeated sensitive data in fine-tuning cause memorization?

When language models train on the same private or proprietary data multiple times, how much do they end up memorizing and leaking that information at inference time? Understanding this risk is critical for organizations fine-tuning on confidential datasets.

How should finetuning scale with model and data size?

What scaling laws govern finetuning performance across model size, pretraining data, and finetuning data? Understanding these relationships could guide resource allocation in real-world tuning scenarios.

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Why does teacher-student information asymmetry enable learning signals?

What role does privileged answer access play in making social meta-learning training work? Without asymmetric information, can a conversation between teacher and student function as pedagogy or only as parallel speculation?

Does staying close to the base model preserve learning ability?

Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.

Can post-training objectives preserve reasoning style alongside correctness?

Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Can splitting adaptation into two channels reduce forgetting?

When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Test-Time Compute

16 notes

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Can verifiers monitor reasoning without slowing generation down?

Explores whether asynchronous verification can catch reasoning errors while keeping token costs near parity with unmonitored reasoning. Matters because current approaches trade between catching early errors and computational overhead.

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

Can inference compute replace scaling up model size?

Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

What makes test-time training actually work in practice?

Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

How can we predict the optimal thinking token threshold?

Researchers are exploring what determines when a model should stop reasoning on a given task, since accuracy degrades beyond a critical threshold but no principled prediction method exists yet.

Reward Models

8 notes

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Can reasoning improvement work without answer verification?

Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.

Inference-Time Scaling

4 notes

Can architecture choices improve inference efficiency without sacrificing accuracy?

Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.