INQUIRING LINE

What makes pretraining composition more important than reward engineering?

This explores a claim several papers in the corpus converge on: that what a model learned during pretraining sets the ceiling on reasoning, while reward design during RL mostly selects and amplifies what's already there rather than adding new capability.


This reads the question as asking whether the heavy lifting in modern reasoning models happens before reward engineering ever starts — and the corpus makes a surprisingly strong case that it does. The clearest statement comes from work on RLVR dynamics, which finds that reinforcement learning with verifiable rewards improves how efficiently a model samples good answers but doesn't expand its capability boundary: a single training example can be enough to 'activate' a reasoning strategy, and even spurious rewards work nearly as well as correct ones — as long as the model was pretrained appropriately What does reward learning actually do to model reasoning?. If a wrong reward signal gets you most of the way there, the reward isn't where the reasoning lives. Pretraining is.

A second study makes the mechanism concrete. When you apply RL on top of a pretrained model, it doesn't invent a new way of formatting answers — within the first epoch it locks onto one dominant format that already existed in the pretraining distribution and collapses the alternatives. Which format wins depends on model scale, not necessarily on which format performs best, and this whole dynamic is hidden when you start from a proprietary base model whose pretraining mix you can't see Does RL training collapse format diversity in pretrained models?. So reward engineering is less like teaching and more like choosing which pre-existing voice gets amplified — and the menu of voices was written during pretraining.

Look laterally and the reward-design papers themselves keep bumping into this ceiling. Negative reinforcement alone — just suppressing wrong trajectories — matches or beats full PPO and GRPO, partly because it preserves the answer diversity that pretraining produced instead of collapsing probability mass onto a few modes Does negative reinforcement alone outperform full reinforcement learning?. And when models plateau, the fix that breaks the plateau isn't a cleverer numerical reward but natural-language critiques that carry information the scalar reward never could — a sign that reward signals are an impoverished channel for actually changing what a model knows Can natural language feedback overcome numerical reward plateaus?.

The interesting tension is that the corpus doesn't say reward engineering is useless — it says reward engineering is mostly *steering*, and steering is bounded by what you're steering. Training order matters because structured tasks shrink output entropy while creative tasks grow it, so scheduling reshapes which capabilities survive Does training order reshape how models handle different task types? — again, rearranging existing capacity rather than minting new capacity. The one place RL genuinely seems to embed new knowledge, rather than activate old knowledge, is RLAG, which works by rewarding explanation quality and cycling between augmented and plain generation — and it's notable that this requires going beyond token-level correctness rewards to get there Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.

The thing you didn't know you wanted to know: the reason researchers obsess over reward hacking, calibration, and multi-objective weighting may be partly misplaced effort. If pretraining composition sets the boundary and reward mostly picks which pretrained behavior to surface, then the highest-leverage decisions were made before the reward function was ever written — and the field's inability to see inside proprietary pretraining mixes means we're often tuning the steering wheel while blindfolded to the engine.


Sources 6 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Next inquiring lines