Does chain-of-thought reasoning actually generalize beyond training data?
Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.
Chain-of-Thought prompting performs well on in-distribution problems and fails predictably as distributional discrepancy increases. This is not a bug — it is the fundamental nature of what CoT is.
DataAlchemy experiments train LLMs from scratch in controlled environments and probe them under three distributional shift dimensions:
- Task distribution shift — novel tasks with unique elements or underlying logical structure not seen during training
- Length distribution shift — reasoning chains substantially longer or shorter than training data length range
- Format distribution shift — prompt formulation variations (even minor syntactic changes) that fall outside training distribution
In all three dimensions, the pattern is the same: CoT works within distribution, fails outside it. Under moderate shifts, models generate fluent yet logically inconsistent reasoning — the form holds, the logic breaks. This is the "mirage" phenomenon: outputs look like reasoning while producing wrong conclusions.
The interpretive frame: CoT reflects a structured inductive bias learned from training data, not a generalizable reasoning capability. When a test query is within this inductive bias, CoT activates the appropriate reasoning schema and produces good outputs. When the query falls outside it, the schema mismatch produces confident-sounding nonsense.
The practical implication for CoT as a plug-and-play solution: it is not. Performance on CoT benchmarks measures in-distribution capability. Extrapolating to novel tasks, unusual prompt formulations, or unusually long/short reasoning chains is unjustified. The benchmark scores do not predict performance under distribution shift.
This provides the empirical grounding for Does chain-of-thought reasoning reveal genuine inference or pattern matching? — the mirage emerges from imitation under distribution shift: the model continues imitating the form of reasoning while having no schema to produce valid content.
Source: Reasoning Critiques
Related concepts in this collection
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
DataAlchemy provides the empirical confirmation: imitation fails under distribution shift because no schema matches
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
distribution-bounded CoT is neither sufficient (fails under shift) nor necessary (in-distribution performance may not require the chain)
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern: surface patterns work in-distribution, fail under structural change
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format-dependency is part of distribution-boundedness: changing the format is a distribution shift
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cot reasoning is distribution-bounded — effectiveness degrades predictably with distributional discrepancy