Do language models fail at reasoning due to complexity or novelty?
Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
The standard narrative around reasoning-model failures — from Shojaee et al.'s Illusion of Thinking onward — frames the phenomenon as a "complexity threshold" or "step threshold": models handle short reasoning chains but break on long ones. Something about the quantity of reasoning breaks down past some limit. The Chollet-Kambhampati exchange reframes this at the instance level, and the reframing matters for what "improving reasoning" can mean.
Chollet's claim: "Many people assume that LRM reasoning breaks down past a certain 'complexity' or 'number of steps' threshold. This is incorrect. It breaks down past an unfamiliarity threshold. And that threshold is very low. There is no limit to the complexity of tasks you can solve with these models, no limit to the number of steps in the reasoning chains they can master — as long as they have been covered during training/tuning. However, show them something unfamiliar, even very simple and requiring just a handful of reasoning steps (e.g., an ARC 2 task), and they will fail." The apparent complexity threshold in Tower of Hanoi exists because Tower of Hanoi is a familiar problem — the step count at which models fail corresponds to the step count at which instances stop appearing in their training data. Scaling step count is an indirect way of generating novelty, not an independent difficulty axis.
Kambhampati adds the systematic observation: LRMs lose accuracy as familiar-problem instances grow because they don't learn algorithms — they fit instance-based patterns. The two agree on the substantive claim even while they initially disagreed on terminology: "We don't actually disagree, we all know that Transformers don't fit generalizable algorithms, they fit instance-based patterns. It doesn't change the fact that the crux of the problem is familiar vs unfamiliar (at the instance level, not at the abstract 'task' level)."
The reframing has sharp implications. First, the intuition that "just scale more reasoning tokens" as a solution to reasoning failures is structurally misguided. If reasoning failure is instance-novelty-driven, then scaling tokens — which extends the reasoning chain — helps only if the longer chain covers more familiar instance territory. It does not extend to any genuinely unfamiliar instance, no matter how short. Second, the natural evaluation target shifts. Benchmarks that scale complexity (Tower of Hanoi with larger N, River Crossing with more pairs) are generating instance novelty indirectly through size. ARC 2 and similar benchmarks generate instance novelty directly through task structure change. The latter is a better measure of whether the model is fitting algorithms or fitting patterns. Third, the definition of "familiarity" matters and Chollet makes it precise: "outside of the classroom, in the real world, you are never exposed to neatly defined 'tasks' and step-by-step algorithms, you are only exposed to situations. Intelligence is the ability to infer generalizable algorithms from situations (instances) only. So the only reasonable definition of familiarity/novelty is at the situation/instance level. If you define it with respect to algorithms you are assuming the problem has already been solved."
This aligns with and sharpens several existing notes. Do foundation models learn world models or task-specific shortcuts? identified task-specific heuristics as the mechanism; Chollet-Kambhampati identify the corresponding failure condition — the heuristics work where they have instance coverage and fail where they do not. Do transformers actually learn systematic compositional reasoning? provides the mathematical substrate: if compositional reasoning is subgraph matching, then novelty at the subgraph level is what breaks the mechanism. Does chain-of-thought reasoning reveal genuine inference or pattern matching? extends this to the performance-vs-reasoning gap: CoT imitates the form of abstract reasoning without performing it, which is exactly why it handles familiar problems at scale but fails on unfamiliar problems at low complexity.
The reframing also creates a tension with some optimistic RL results. Can extended RL training discover reasoning strategies base models cannot? shows that extended RL can produce strategies not present in the base model. If reasoning is purely instance-pattern-fitting, where does the novelty in ProRL come from? A reconciliation: RL-discovered "novel strategies" may still be instance-family novelty — the model learns to combine previously separate instance patterns in new ways, producing what looks like strategy but is still pattern composition. This would be genuine progress within the instance-pattern regime without escaping it. A test: take a ProRL-extended model and evaluate it on ARC 2. If the instance-novelty thesis is right, ProRL gains should not transfer to instance-level novelty challenges.
The practical implication for evaluation design is straightforward. Current benchmarks that scale complexity to induce failure are indirectly measuring instance coverage in training data. Benchmarks that induce instance novelty at fixed short complexity — ARC 2, held-out reasoning tasks with genuinely new structure — measure what matters: whether the model is doing anything other than pattern lookup.
Source: Flaws
Related concepts in this collection
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the mechanism beneath the phenomenon; heuristics work within instance coverage and fail outside
-
Do transformers actually learn systematic compositional reasoning?
Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.
the mathematical substrate: subgraph matching is instance-level pattern matching
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
CoT imitates form without performing inference; unfamiliarity reveals the imitation
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the apparent threshold may be unfamiliarity not tokens
-
Why do reasoning LLMs fail at deeper problem solving?
Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
wandering may be the novelty response
-
Does the reasoning cliff depend on how we test models?
If language models hit a capability wall in text-only reasoning tasks, does that wall disappear when they can use tools? What does this reveal about what we're actually measuring?
complementary reframing at the execution layer
-
Can extended RL training discover reasoning strategies base models cannot?
Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.
apparent tension; possibly resolved as instance-family novelty rather than algorithm novelty
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
partial counterpoint: scaling data closes some generalization gaps, but instance novelty remains the boundary
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
FER is the representation-level parallel; identical benchmark scores can mask different instance coverage
-
Can transformers improve exponentially by learning from their own correct solutions?
Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.
subtle counterpoint: length generalization within a familiar task family (addition at longer digit counts) still extends beyond initial instance coverage through iteration; but the instance type stays familiar, so this may be "same-algorithm novelty" that the thesis accommodates
-
Are reasoning model failures really about reasoning ability?
Explores whether the performance collapse in language reasoning models reflects actual reasoning limitations or merely execution constraints. Tests whether tool access changes the picture.
alternative diagnosis at the execution layer
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
LRM reasoning breakdown is driven by instance-level unfamiliarity not task-level complexity — there is no limit to reasoning chain length as long as the instances were covered during training