LLM Reasoning and Architecture Reinforcement Learning for LLMs

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Note · 2026-04-07 · sourced from Flaws
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

The standard narrative around reasoning-model failures — from Shojaee et al.'s Illusion of Thinking onward — frames the phenomenon as a "complexity threshold" or "step threshold": models handle short reasoning chains but break on long ones. Something about the quantity of reasoning breaks down past some limit. The Chollet-Kambhampati exchange reframes this at the instance level, and the reframing matters for what "improving reasoning" can mean.

Chollet's claim: "Many people assume that LRM reasoning breaks down past a certain 'complexity' or 'number of steps' threshold. This is incorrect. It breaks down past an unfamiliarity threshold. And that threshold is very low. There is no limit to the complexity of tasks you can solve with these models, no limit to the number of steps in the reasoning chains they can master — as long as they have been covered during training/tuning. However, show them something unfamiliar, even very simple and requiring just a handful of reasoning steps (e.g., an ARC 2 task), and they will fail." The apparent complexity threshold in Tower of Hanoi exists because Tower of Hanoi is a familiar problem — the step count at which models fail corresponds to the step count at which instances stop appearing in their training data. Scaling step count is an indirect way of generating novelty, not an independent difficulty axis.

Kambhampati adds the systematic observation: LRMs lose accuracy as familiar-problem instances grow because they don't learn algorithms — they fit instance-based patterns. The two agree on the substantive claim even while they initially disagreed on terminology: "We don't actually disagree, we all know that Transformers don't fit generalizable algorithms, they fit instance-based patterns. It doesn't change the fact that the crux of the problem is familiar vs unfamiliar (at the instance level, not at the abstract 'task' level)."

The reframing has sharp implications. First, the intuition that "just scale more reasoning tokens" as a solution to reasoning failures is structurally misguided. If reasoning failure is instance-novelty-driven, then scaling tokens — which extends the reasoning chain — helps only if the longer chain covers more familiar instance territory. It does not extend to any genuinely unfamiliar instance, no matter how short. Second, the natural evaluation target shifts. Benchmarks that scale complexity (Tower of Hanoi with larger N, River Crossing with more pairs) are generating instance novelty indirectly through size. ARC 2 and similar benchmarks generate instance novelty directly through task structure change. The latter is a better measure of whether the model is fitting algorithms or fitting patterns. Third, the definition of "familiarity" matters and Chollet makes it precise: "outside of the classroom, in the real world, you are never exposed to neatly defined 'tasks' and step-by-step algorithms, you are only exposed to situations. Intelligence is the ability to infer generalizable algorithms from situations (instances) only. So the only reasonable definition of familiarity/novelty is at the situation/instance level. If you define it with respect to algorithms you are assuming the problem has already been solved."

This aligns with and sharpens several existing notes. Do foundation models learn world models or task-specific shortcuts? identified task-specific heuristics as the mechanism; Chollet-Kambhampati identify the corresponding failure condition — the heuristics work where they have instance coverage and fail where they do not. Do transformers actually learn systematic compositional reasoning? provides the mathematical substrate: if compositional reasoning is subgraph matching, then novelty at the subgraph level is what breaks the mechanism. Does chain-of-thought reasoning reveal genuine inference or pattern matching? extends this to the performance-vs-reasoning gap: CoT imitates the form of abstract reasoning without performing it, which is exactly why it handles familiar problems at scale but fails on unfamiliar problems at low complexity.

The reframing also creates a tension with some optimistic RL results. Can extended RL training discover reasoning strategies base models cannot? shows that extended RL can produce strategies not present in the base model. If reasoning is purely instance-pattern-fitting, where does the novelty in ProRL come from? A reconciliation: RL-discovered "novel strategies" may still be instance-family novelty — the model learns to combine previously separate instance patterns in new ways, producing what looks like strategy but is still pattern composition. This would be genuine progress within the instance-pattern regime without escaping it. A test: take a ProRL-extended model and evaluate it on ARC 2. If the instance-novelty thesis is right, ProRL gains should not transfer to instance-level novelty challenges.

The practical implication for evaluation design is straightforward. Current benchmarks that scale complexity to induce failure are indirectly measuring instance coverage in training data. Benchmarks that induce instance novelty at fixed short complexity — ARC 2, held-out reasoning tasks with genuinely new structure — measure what matters: whether the model is doing anything other than pattern lookup.


Source: Flaws

Related concepts in this collection

Concept map
21 direct connections · 222 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

LRM reasoning breakdown is driven by instance-level unfamiliarity not task-level complexity — there is no limit to reasoning chain length as long as the instances were covered during training