What actually changes inside a model during RL training? · Gravity7

What RL Modifies

5 notes

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Can extended RL training discover reasoning strategies base models cannot?

Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.

Does RL training create new reasoning skills or activate existing ones?

Understanding whether reinforcement learning actually builds novel capabilities or simply teaches models when to use reasoning they already possess. This matters for predicting RL's value across different task types.

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

RL Formalization and Architecture

5 notes

How does thinking emerge from policy selection in RL?

Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.

Can vanilla PPO match specialized reasoning algorithms with just two techniques?

Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Training Dynamics

4 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Can chain-of-thought reasoning emerge during pretraining itself?

Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.

Novel Reward Paradigms

1 note

Can machines learn what makes research worth doing?

Can AI systems trained on community citation patterns learn to recognize high-impact research directions the way human scientists do? The research explores whether 'scientific taste'—judgment about what to pursue—is learnable from collective community signals.

Process Rewards and Judges

2 notes

Can judges that reason about reasoning outperform step classifiers?

Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.

Can adversarial training replace task-specific verifiers for reasoning?

Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.

Multi-Turn and Sequential RL

2 notes

Can cumulative rewards teach LLMs multi-step decision making?

Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Scaling and Methodology

2 notes

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Does RL training follow predictable scaling curves?

Can we forecast where RL training will plateau before committing full compute? ScaleRL tests whether sigmoid curves reliably predict performance ceilings across 200+ models.

Alignment and Personalization

2 notes

Can text summaries condition reward models better than embeddings?

Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.

Why do language models fail to act on their own reasoning?

LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?

Fine-Tuning Side Effects

3 notes

Does fine-tuning weaken how reasoning steps influence answers?

When models are fine-tuned on domain-specific tasks, do their chain-of-thought reasoning steps actually causally drive the final answer, or do they become decorative? This matters because accurate outputs can mask unfaithful reasoning.

Can we decouple what pretraining and fine-tuning each improve?

Does scaling at different training stages produce distinct capability improvements? This matters because it could reveal whether knowledge and behavioral alignment are truly separate properties we can optimize independently.

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Parameter-Efficient and Alternative Tuning

5 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Can context playbooks prevent knowledge loss during iteration?

When AI systems iteratively refine their instructions and memories, do structured incremental updates better preserve domain knowledge than traditional rewriting? This matters because context degradation undermines long-term agent performance.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Data Selection and Reasoning Architecture

2 notes

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Can abstractions guide exploration better than depth alone?

Does training a model to propose reasoning abstractions as intermediate subgoals help it explore diverse solution strategies more effectively than simply extending chain-of-thought depth?

Related Areas

7 notes

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

How well do reward models actually evaluate reasoning?

Can systems that judge AI reasoning be trusted to work reliably, or do they fail in systematic ways? This matters because flawed evaluators can't improve the systems they train.

How should reasoning systems actually be architected?

What design patterns and mechanisms make reasoning systems more capable and efficient? This explores whether reasoning emerges from training or architecture, and how to build systems that reason effectively without massive compute.

How does reinforcement learning reshape what models can reason about?

RL training modifies model parameters and exploration strategies, but what capabilities does it actually unlock versus degrade? This map explores RL mechanics, reward dynamics, and the hidden costs of optimization.

How does reinforcement learning reshape what models can reason about?

RL training modifies model parameters and exploration strategies, but what capabilities does it actually unlock versus degrade? This map explores RL mechanics, reward dynamics, and the hidden costs of optimization.

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.