How do training objectives shape what a world model actually learns?

This explores how the choice of training objective — what you reward or predict — determines whether a model builds a genuine model of how the world works, or just shortcuts that happen to score well.

This explores how the choice of training objective — what you optimize for — determines whether a model builds a genuine, usable model of the world or just learns convincing shortcuts. The corpus has a recurring, slightly uncomfortable answer: objectives rarely teach what we assume they teach. When you reward prediction accuracy, you tend to get prediction accuracy — not understanding. Transformers trained on orbital mechanics or board games hit high accuracy by stitching together task-specific heuristics rather than recovering the underlying laws; probe them off-distribution and the 'laws' turn out to be nonsensical and slice-dependent, and arithmetic resolves into range-matching tricks rather than algorithms Do foundation models learn world models or task-specific shortcuts?. The objective got optimized; the world model never showed up. That gap is exactly why some researchers argue a world model's objective should be reframed entirely — not 'predict the next observation' but 'simulate the space of actions you could actually take,' so that what's learned supports reasoning about interventions and counterfactuals instead of surface regularities What makes a world model actually useful for reasoning?, What should a world model actually be designed to do?.

The same lesson shows up away from world models, which is what makes it feel structural rather than incidental. Instruction tuning is supposed to teach task understanding — but models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones. What the objective actually transfers is knowledge of the output format, not comprehension of the task Does instruction tuning teach task understanding or output format?. Reinforcement learning tells a parallel story: verifiable rewards act less like teachers and more like catalysts, surfacing reasoning strategies that already lived in the pretrained prior rather than building new ones, with updates that are structurally sparse and bounded by what pretraining already knew How does RL training reshape reasoning and what gets lost?, Does reinforcement learning update only a small fraction of parameters?. So the objective doesn't write capability onto a blank slate — it selects, amplifies, and routes what's already there.

What the objective amplifies, it also suppresses. RL post-training tends to collapse onto a single dominant format from pretraining within the first epoch, killing off alternatives — and the winning format depends on model scale, not necessarily on which format performs best Does RL training collapse format diversity in pretrained models?. Different domains respond in opposite directions to the same pressure: structured tasks drive output entropy down while creative ones drive it up, so the *order* in which you apply objectives mechanically reshapes what survives — train structured tasks first and you avoid entropy collapse crushing open-ended ability Does training order reshape how models handle different task types?. The objective even has internal phases: RL first consolidates procedural execution, then shifts the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?.

Here's the part you might not have known you wanted: the objective can quietly change what *kind of thing* the model is. Post-training appears to shift models from passive prediction to a form of enaction — they begin treating their own outputs as actions that shape their future inputs, closing an action-perception loop that pretraining never had (visible as 3-4x lower on-policy output entropy) Do models recognize their own outputs as actions shaping future inputs?. That's a striking echo of the 'simulate actionable possibilities' framing for world models — the objective doesn't just fill in content, it can install a stance toward the world. And what the model is fed structurally matters as much as what it's scored on: in-context learning of sequential decisions requires bursts of full trajectories from the same environment, not isolated examples, suggesting the 'objective' is really the whole training distribution's shape, not just the loss Why do trajectories matter more than individual examples for in-context learning?.

The practical throughline: because objectives select rather than teach, *how hard* you push matters. Keeping KL drift from the base model low preserves the model's plasticity — its ability to keep learning later tasks — where aggressive parameter-only optimization stalls when the domain shifts Does staying close to the base model preserve learning ability?. So shaping what a world model learns is less about specifying the right answer and more about choosing what to amplify, what order to amplify it in, and how much room to leave for everything you didn't optimize for.

Sources 12 notes

Do foundation models learn world models or task-specific shortcuts?

Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.

What makes a world model actually useful for reasoning?

Research shows LLMs may achieve high prediction accuracy through task-specific heuristics without developing coherent generative models of how the world works. True world models must enable reasoning about interventions and counterfactuals, not surface regularities.

What should a world model actually be designed to do?

Drawing on hypothetical thinking in psychology, world models are most useful when designed to simulate all actionable possibility spaces—physical, embodied, emotional, social, mental, counterfactual, and evolutionary—grounded in agent decision-making rather than passive prediction.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

How do training objectives shape what a world model actually learns?

Sources 12 notes

Next inquiring lines