LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

Chain-of-Thought prompting performs well on in-distribution problems and fails predictably as distributional discrepancy increases. This is not a bug — it is the fundamental nature of what CoT is.

DataAlchemy experiments train LLMs from scratch in controlled environments and probe them under three distributional shift dimensions:

  1. Task distribution shift — novel tasks with unique elements or underlying logical structure not seen during training
  2. Length distribution shift — reasoning chains substantially longer or shorter than training data length range
  3. Format distribution shift — prompt formulation variations (even minor syntactic changes) that fall outside training distribution

In all three dimensions, the pattern is the same: CoT works within distribution, fails outside it. Under moderate shifts, models generate fluent yet logically inconsistent reasoning — the form holds, the logic breaks. This is the "mirage" phenomenon: outputs look like reasoning while producing wrong conclusions.

The interpretive frame: CoT reflects a structured inductive bias learned from training data, not a generalizable reasoning capability. When a test query is within this inductive bias, CoT activates the appropriate reasoning schema and produces good outputs. When the query falls outside it, the schema mismatch produces confident-sounding nonsense.

The practical implication for CoT as a plug-and-play solution: it is not. Performance on CoT benchmarks measures in-distribution capability. Extrapolating to novel tasks, unusual prompt formulations, or unusually long/short reasoning chains is unjustified. The benchmark scores do not predict performance under distribution shift.

This provides the empirical grounding for Does chain-of-thought reasoning reveal genuine inference or pattern matching? — the mirage emerges from imitation under distribution shift: the model continues imitating the form of reasoning while having no schema to produce valid content.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
14 direct connections · 117 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cot reasoning is distribution-bounded — effectiveness degrades predictably with distributional discrepancy