LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Can reconstructing expert thinking improve reasoning transfer?

Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?

Note · 2026-05-03 · sourced from Data

Standard reasoning training uses supervised fine-tuning or reinforcement learning, which require task-specific signals (math correctness, code execution) and therefore cannot scale across domains where verifiable feedback is unavailable. Continual pretraining (CPT) avoids this constraint but provides no reasoning signal — the model just sees more text. Reasoning CPT proposes a third path: every expert text (a math proof, a legal opinion) is the visible result of an underlying thought process involving trial, hypothesis, recall, and verification, and that hidden thought process can be reconstructed as synthetic data — the same surface-vs-process distinction that drives Why do language models need so much more text than humans?.

The reconstruction targets four characteristic aspects of expert thinking: human-like spontaneous expressions ("Hmm... ", "Aha!"), background knowledge recall (internally retrieving relevant rules), decision-making (considering an action), and self-verification (checking for omissions). The synthetic training sequence concatenates the original text with its reconstructed hidden thoughts, giving the model both the visible result and the implicit process behind it.

Three findings distinguish this from standard CPT. First, cross-domain transfer: training hidden thoughts from law improves not just MMLU social sciences but MMLU-STEM by 4.3 points, because the reasoning skill — not the domain knowledge — transfers. Second, the gap widens with difficulty: on the hardest MMLU problems, Reasoning CPT reaches 51.8-52.5% accuracy versus 43.9-44.6% for CPT, a roughly 8-point advantage. Third, models automatically adjust reasoning length to problem difficulty — short for easy, long for hard — without explicit instruction.

A plausible mechanism for the adaptive reasoning length: the training corpus shows positive correlation between original-text length and hidden-thought length (Spearman ρ = 0.348 STEM, 0.486 Law). The model learns a heuristic — continue thinking until enough evidence accumulates to confidently predict the next token — which produces short chains for easy questions and long chains for hard ones. The implication is that overthinking and underthinking are both consequences of training on text that does not reveal its own thinking-effort calibration.

Source: DataTopics:

Related concepts in this collection

Why do language models need so much more text than humans? Language models train on the surface of written text, but humans learn by inferring the underlying thoughts behind what they read. Does this explain why models need vastly more data to reach human-level understanding?
extends: companion piece — same compressed-surface diagnosis applied at the pretraining-data level instead of the inference level
Can chain-of-thought reasoning emerge during pretraining itself? Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.
complements: RPT and Reasoning CPT both train reasoning at pretraining time but with different signals — information-gain reward vs reconstructed hidden thoughts
Can pretraining corpora themselves provide verifiable RL rewards? Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
complements: RPT generalizes reasoning to any domain via RL on next-token; this note generalizes via reconstructed thoughts; both attack domain-specificity of reasoning training
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: hidden-thought reconstruction as a way of activating latent capability without RLVR's verifiability requirement
Does AI text generation unfold through temporal reflection? Explores whether the sequential ordering of tokens in LLM generation constitutes genuine temporal thought or merely probabilistic computation without reflective duration.
tension: reconstructed thoughts add a quasi-temporal trace ("Hmm... Aha!") to training data, but surface markers of temporal cognition do not actually install temporality

Concept map

16 direct connections · 169 in 2-hop network ·dense cluster

Can reconstructing expert thinking improve reaso… Why do language models need so much more text than… Can chain-of-thought reasoning emerge during pretr… Can pretraining corpora themselves provide verifia… Do base models already contain hidden reasoning ab… Does AI text generation unfold through temporal re…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

expert texts are surface residues of hidden thought processes — and reconstructing those processes for pretraining produces cross-domain reasoning transfer impossible in standard CPT