Agentic and Multi-Agent Systems LLM Reasoning and Architecture

Are reasoning model failures really about reasoning ability?

Explores whether the performance collapse in language reasoning models reflects actual reasoning limitations or merely execution constraints. Tests whether tool access changes the picture.

Note · 2026-02-23 · sourced from Flaws
Where exactly do reasoning models fail and break?

The "reasoning cliff" — where LRM performance collapses beyond certain complexity thresholds — is reframed as an execution failure, not a reasoning failure. When models are confined to text-only generation, they are forced into the role of "human simulator" (transcribing thousands of discrete steps) rather than "problem solver" (offloading procedural execution to appropriate tools).

The evidence: providing models with explicit algorithms for Tower of Hanoi does not prevent collapse. The model knows the algorithm but cannot execute it autoregressively at scale. This is a tool-use problem, not a reasoning problem. When given code execution access, models solve problems far beyond the supposed cliff.

Tool-enabled evaluation reveals an agentic hierarchy:

First-Order Agency — GPT-4o uses tools for straightforward procedural execution. It implements a strategy and runs it. When the strategy fails, it doesn't recover.

Second-Order Agency — o4-mini uses tools for verification and metacognitive self-correction. It begins with a flawed hypothesis, detects the failure through self-generated simulation, discards the failed strategy, and selects an entirely new correct approach. This plan-test-fail-revise loop mirrors deliberate practice.

The most revealing failure mode: when confined to text-only, models that cannot maintain state and exhaust search spaces declare solvable problems "logically impossible." They mistake their own execution limitations for fundamental impossibilities — a phenomenon analogous to learned helplessness.

The reframe has practical implications. The question shifts from "Can models reason?" to "What kind of reasoners are they, and under what conditions can they ascend the agentic hierarchy?" Evaluations that prohibit tool use are measuring execution bandwidth, not reasoning capability.


Source: Flaws

Related concepts in this collection

Concept map
16 direct connections · 181 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning model performance collapses are execution failures not reasoning failures — tool use reveals an agentic hierarchy