How much does pretraining contribute to ToM performance versus task-specific training?

This reads the question as really asking where a model's underlying capability lives — does the reasoning that powers Theory-of-Mind-style tasks get built during pretraining, with task-specific training mostly teaching the model how to express it — and the corpus has no ToM papers directly, but it has a sharp recurring answer to exactly that capability-vs-expression split.

This explores where the actual competence behind a skill like Theory of Mind comes from — the broad pretraining run or the later task-specific tuning. The corpus doesn't contain ToM studies by name, so I can't give you a number for ToM specifically. But several notes converge on a striking decomposition that almost certainly governs ToM too: the *capability* is laid down in pretraining, while task-specific training mostly teaches the model the *format* of the answer, not the underlying understanding.

The cleanest statement of this is the finding that pretraining scale drives factual knowledge while fine-tuning scale drives behavioral helpfulness — and crucially, the split has an architectural basis: pretraining enriches lower-layer knowledge storage, fine-tuning adjusts upper-layer behavior expression Do pretraining and fine-tuning scale independently in language models?. If ToM behaves like other knowledge-laden capabilities, that predicts the heavy lifting happens before any task-specific training begins. The most provocative companion result is that instruction tuning teaches output format rather than task understanding — models trained on deliberately *wrong* or semantically empty instructions match those trained on correct ones, because what transfers is knowledge of the output space, not the task itself Does instruction tuning teach task understanding or output format?. Read together, these suggest task-specific training may contribute far less to genuine ToM reasoning than its benchmark gains imply — it could be unlocking a way of answering rather than a way of reasoning.

Reinforcement-style post-training tells the same story from another angle: RL doesn't invent new behavior so much as amplify one format already present in the pretraining distribution while suppressing the alternatives, all within the first epoch Does RL training collapse format diversity in pretrained models?. That's a selection-from-priors picture, not a teach-new-skills picture — which would mean ToM ability is being *surfaced* from pretraining, not created by the task tuning.

There's a real counterweight, though. A line of work argues you can plant reasoning capability *earlier*, during pretraining itself — treating chain-of-thought as an exploratory action rewarded by information gain lifts reasoning benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?, and augmenting pretraining data with generated thinking traces yields 3x data efficiency Can training data augmentation match test-time compute scaling benefits?. The implication for ToM is subtle: if reasoning depth is malleable during pretraining, then "pretraining contribution" isn't a fixed quantity — it depends on whether the pretraining mix contained the right kind of reasoning trajectories in the first place. A reliability caveat sits underneath all of this: chain-of-thought reasoning degrades predictably once you leave the training distribution, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data? — so apparent ToM gains from task training may be distribution-bound imitation rather than transferable competence.

The thing worth walking away with: the question "pretraining vs. task training" may be the wrong frame. The corpus repeatedly shows these aren't two sources of the same skill competing for credit — they operate on different layers and do different jobs (knowledge storage vs. behavior expression), and a benchmark score can't tell you which one moved. To probe ToM honestly you'd want to separate "does the model *know* how minds work" from "did tuning teach it the shape of a ToM answer" — and the instruction-tuning result is the doorway that shows those two can come apart completely.

Sources 6 notes

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

How much does pretraining contribute to ToM performance versus task-specific training?

Sources 6 notes

Next inquiring lines