Why does long CoT training optimize for structural coherence over content correctness?

This explores why training on long chain-of-thought traces seems to teach models the *shape* of reasoning — how steps connect and sequence — rather than whether the facts inside those steps are actually right.

This explores why training on long chain-of-thought traces seems to teach models the *shape* of reasoning — how steps connect — rather than whether the content inside is correct. The sharpest evidence comes from controlled ablations: models tolerate having 50% of the numbers in their training traces corrupted (only a 3.2% accuracy hit), but fall apart when you shuffle the order of the steps (13.3% loss) What do models actually learn from chain-of-thought training?. In other words, what actually distills from a reasoning demonstration is its logical architecture — the scaffolding of how one move leads to the next — not the factual accuracy of any given move. Get the scaffolding right and the model is happy; break it and the model breaks, even when the facts were perfect.

Why would training reward structure over content? Because CoT, at bottom, is imitation of a *form*. Models pattern-match reasoning structure rather than perform genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching? Why does chain-of-thought reasoning fail in predictable ways?. The most startling demonstration of this: logically *invalid* CoT exemplars perform nearly as well as valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. If the model were learning to reason, broken logic should hurt — but it barely does, because the gains were never coming from validity. They were coming from the recognizable choreography of step-by-step text. Training optimizes for whatever produces the reward, and the reward signal turns out to be carried by structure, not truth.

The cost of this shows up the moment you leave familiar territory. CoT degrades predictably under distribution shift — in task, length, and format — producing fluent but logically inconsistent reasoning Does chain-of-thought reasoning actually generalize beyond training data?. That's the signature of imitation rather than capability: a model that learned the *content* of reasoning would generalize; a model that learned the *form* reproduces the form smoothly while the content quietly goes wrong off-distribution. The fluency is exactly what makes it dangerous — the structure stays coherent even as correctness evaporates.

There's a deeper structural pull here too. Post-training tends to collapse toward dominant patterns: RL converges on a single pretraining format within the first epoch, amplifying one distribution while suppressing alternatives — and the winner is chosen by scale, not by performance Does RL training collapse format diversity in pretrained models?. So the optimization pressure isn't even neutral toward structure; it actively narrows toward whatever formal pattern is most reinforced. And the same root failure recurs elsewhere in the corpus — models lean on surface heuristics rather than genuine structural rules in grammar too, handling simple sentences well but failing on recursion and deep embedding Does LLM grammatical performance decline with structural complexity?. The thread connecting these is that gradient descent finds the cheapest correlate of the reward, and "looks like valid reasoning" is far cheaper to learn than "is valid reasoning."

The thing worth carrying away: this isn't a bug you can patch by adding more correct examples, because the training objective itself can't distinguish a correct trace from a structurally-identical wrong one. If you want models that track content, you may need a different lever than imitation entirely — note that decoding-time proxy tuning, which leaves base weights untouched, preserves knowledge precisely because it shifts *style and reasoning* without corrupting the lower-layer storage where content lives Can decoding-time tuning preserve knowledge better than weight fine-tuning?. That contrast hints at the real fault line: structure and content live in different parts of the model, and CoT training has been tuning the wrong one.

Sources 8 notes

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Why does long CoT training optimize for structural coherence over content correctness?

Sources 8 notes

Next inquiring lines