What makes some bottlenecks invisible to chain-of-thought training?
This explores why certain limits in a model's reasoning never get fixed during chain-of-thought training — the corpus suggests it's because training rewards the *form* of reasoning and proximity to familiar examples, so the real failure points stay hidden from the signal that's supposed to correct them.
This explores why certain limits in a model's reasoning never get fixed during chain-of-thought (CoT) training. The short version the corpus keeps circling back to: training optimizes for the *look* of reasoning, not the work of it — so the places where reasoning actually breaks never produce a signal the training loop can act on. They're invisible because nothing in the reward is watching them.
The foundational reason is that CoT is closer to imitation than inference. Several notes converge here — that chain-of-thought reproduces familiar reasoning *schemata* from training rather than performing genuine symbolic steps Does chain-of-thought reasoning reveal genuine inference or pattern matching?, that format and spatial structure shape the output far more than logical content (training format mattered 7.5× more than domain, and even logically invalid prompts worked) What makes chain-of-thought reasoning actually work?, and that fluent-but-wrong reasoning appears exactly when you push outside the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching?. If a model is rewarded for matching a pattern, then a bottleneck that only shows up off-distribution is structurally invisible: in-distribution the pattern fires and the answer is right, so training sees success where the underlying capability is actually missing.
The clearest demonstration of this blindness is what trace length turns out to measure. You'd assume longer reasoning means the model is working harder on a harder problem — but controlled maze experiments show trace length tracks difficulty only *inside* the training distribution and decouples completely outside it Does longer reasoning actually mean harder problems?. Length is mostly recall of a familiar schema, not adaptive computation. So a problem that genuinely needs more steps but sits outside the training distribution gets a *short* trace and a confident wrong answer — and training has no way to tell that apart from genuine ease. The bottleneck (out-of-distribution composition) and the signal (trace length) point in opposite directions.
There's a second, subtler source: where errors actually come from. A token-level analysis finds that *local* memorization — leaning on the immediately preceding tokens rather than the problem — drives up to 67% of reasoning errors, and gets worse precisely as complexity and distribution shift increase Where do memorization errors arise in chain-of-thought reasoning?. This is a bottleneck that hides inside correct-looking chains: the model isn't reasoning forward, it's autocompleting locally, and the chain reads fine. Relatedly, attention maps reveal that verification and backtracking steps — exactly the self-checking that would catch errors — receive almost no downstream attention; you can prune 75% of reasoning steps with no accuracy loss because most of them were never load-bearing Can reasoning steps be dynamically pruned without losing accuracy?. The model writes the verification but doesn't *use* it, so training can't reward repairing a check it isn't relying on.
What ties this together is that training's own pressures push *away* from the bottlenecks. RL training drifts toward shorter chains as models improve, because reward favors simplicity — optimal CoT length follows an inverted-U and the productive middle is easy to overshoot Why does chain of thought accuracy eventually decline with length? — and most of a verbose chain is documentation rather than computation anyway (Chain of Draft matches accuracy at 7.6% of the tokens) Can minimal reasoning chains match full explanations?. The genuinely hard bottlenecks live in the rare cases where a problem *needs* sequential accumulation — graph-connectivity-style tasks where sequential CoT beats parallel voting exponentially When does sequential reasoning beat parallel voting? — and those are exactly the cases a length-minimizing, pattern-matching reward is least equipped to protect. The provocative thread worth pulling: some work suggests these blind spots may be addressable by where reasoning is *planted*, treating CoT as an exploratory action rewarded during pretraining rather than bolted on afterward Can chain-of-thought reasoning be learned during pretraining itself? — which would only matter if standard post-hoc CoT training genuinely can't see what it's missing.
Sources 9 notes
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.