LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Can reconstructing expert thinking improve reasoning transfer?

Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?

Note · 2026-05-03 · sourced from Data

Standard reasoning training uses supervised fine-tuning or reinforcement learning, which require task-specific signals (math correctness, code execution) and therefore cannot scale across domains where verifiable feedback is unavailable. Continual pretraining (CPT) avoids this constraint but provides no reasoning signal — the model just sees more text. Reasoning CPT proposes a third path: every expert text (a math proof, a legal opinion) is the visible result of an underlying thought process involving trial, hypothesis, recall, and verification, and that hidden thought process can be reconstructed as synthetic data — the same surface-vs-process distinction that drives Why do language models need so much more text than humans?.

The reconstruction targets four characteristic aspects of expert thinking: human-like spontaneous expressions ("Hmm... ", "Aha!"), background knowledge recall (internally retrieving relevant rules), decision-making (considering an action), and self-verification (checking for omissions). The synthetic training sequence concatenates the original text with its reconstructed hidden thoughts, giving the model both the visible result and the implicit process behind it.

Three findings distinguish this from standard CPT. First, cross-domain transfer: training hidden thoughts from law improves not just MMLU social sciences but MMLU-STEM by 4.3 points, because the reasoning skill — not the domain knowledge — transfers. Second, the gap widens with difficulty: on the hardest MMLU problems, Reasoning CPT reaches 51.8-52.5% accuracy versus 43.9-44.6% for CPT, a roughly 8-point advantage. Third, models automatically adjust reasoning length to problem difficulty — short for easy, long for hard — without explicit instruction.

A plausible mechanism for the adaptive reasoning length: the training corpus shows positive correlation between original-text length and hidden-thought length (Spearman ρ = 0.348 STEM, 0.486 Law). The model learns a heuristic — continue thinking until enough evidence accumulates to confidently predict the next token — which produces short chains for easy questions and long chains for hard ones. The implication is that overthinking and underthinking are both consequences of training on text that does not reveal its own thinking-effort calibration.


Source: DataTopics:

Related concepts in this collection

Concept map
16 direct connections · 169 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

expert texts are surface residues of hidden thought processes — and reconstructing those processes for pretraining produces cross-domain reasoning transfer impossible in standard CPT