How much do compressed reasoning traces transfer across different models?
This explores whether reasoning traces — once shortened or distilled into compact form — carry their usefulness from the model that produced them to a different model, and the corpus mostly speaks to this sideways rather than head-on.
This reads the question as: when you compress a reasoning trace (strip it down, or use the raw 'thinking' as shortened context), how much of its value survives being handed to a different model? The honest framing first — the corpus has almost no paper that directly benchmarks one model's compressed trace inside another model. What it does have is a set of findings about *what a trace actually is*, and those findings strongly shape what you'd expect transfer to look like.
The first surprise is that a reasoning trace can act as its own compressor. A model's raw thinking, used directly as shortened context, beats most purpose-built compression methods with no special training Can a reasoning model's thinking trace compress context effectively?. And you can squeeze hard: 'Chain of Draft' matches full chain-of-thought accuracy while keeping only 7.6% of the tokens, because the other 92% was style and documentation, not computation Can minimal reasoning chains match full explanations?. So compression isn't lossy in the way you'd fear — much of a trace is padding.
Now the twist that bears on transfer. Several papers argue a trace's *semantic content* isn't what carries the performance. Models trained on deliberately corrupted, irrelevant traces stay just as accurate and sometimes generalize better, suggesting traces work as computational scaffolding rather than meaning Do reasoning traces need to be semantically correct?. Invalid logical steps perform nearly as well as valid ones — traces are persuasive appearances, not verified reasoning Do reasoning traces show how models actually think?. Format and spatial structure shape outcomes far more than logical content What makes chain-of-thought reasoning actually work?. If the active ingredient is structure-as-scaffold rather than portable propositions, then transfer becomes a question about whether the *receiving* model can use that scaffold — which depends heavily on its own training. And training regime is decisive: non-reasoning models never catch up to reasoning models no matter how much inference compute you give them, because reasoning is a learned protocol, not free-floating tokens Can non-reasoning models catch up with more compute?.
The sharpest caution comes from generalization work: chain-of-thought degrades predictably the moment you shift task, length, or format away from where it was trained, producing fluent-but-inconsistent reasoning Does chain-of-thought reasoning actually generalize beyond training data?. A compressed trace from model A is, to model B, exactly that kind of distribution shift — so you'd expect transfer to hold on familiar territory and fray on unfamiliar problem structures. Two findings hint at what *would* survive: the load-bearing parts of a trace are sparse — planning and backtracking 'thought anchors' that pivot the reasoning Which sentences actually steer a reasoning trace? — and step-level confidence can flag where a trace breaks down rather than trusting it wholesale Does step-level confidence outperform global averaging for trace filtering?. The takeaway you didn't know you wanted: if traces are scaffolding rather than meaning, transferring them across models is less about preserving the words and more about whether the receiving model was trained to climb that kind of scaffold at all.
Sources 9 notes
A reasoning model's raw thinking trace, used directly as shortened context, outperforms most dedicated compression methods without requiring specialized modules or compression-specific training. The mechanism that enables reasoning also produces usable input compression.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.