How does separating decomposition from execution improve multi-step reasoning?

This explores why splitting the 'figure out the steps' part of reasoning from the 'actually do each step' part makes LLMs solve multi-step problems better — and what that separation buys you across architectures.

This explores why splitting the 'figure out the steps' part of reasoning from the 'actually do each step' part makes LLMs solve multi-step problems better. The corpus has a surprisingly consistent answer: planning and executing interfere with each other inside a single model, and pulling them apart removes that interference while producing skills that travel. The cleanest evidence is that a separate decomposer-plus-solver beats a monolithic model — and the interesting twist is that the *decomposition* ability transfers across domains while the *solving* ability does not Does separating planning from execution improve reasoning accuracy?. That asymmetry is the real prize: knowing how to break a problem down is a general skill worth isolating and reusing, whereas execution stays task-bound.

The same logic shows up wherever 'execution' means calling a tool. When reasoning and tool observations are interleaved in one stream, the prompt grows quadratically and every step waits on the last; decoupling the plan from the tool responses (ReWOO's plan-before-execute, Chain-of-Abstraction's placeholder variables) kills the redundancy and unlocks parallelism without hurting quality Can reasoning and tool execution be truly decoupled?. A related move treats the algorithm — not the model — as the planner: LLM Programs put each step inside explicit control flow and feed the model only step-relevant context, turning a tangled reasoning task into modular, debuggable sub-calls Can algorithms control LLM reasoning better than LLMs alone?. The shared insight is that a model forced to hold the whole plan *and* the current step in its head does both worse.

There's a deeper, almost counterintuitive theme here about memory. Several notes argue that accumulated history is the enemy, and decomposition is what lets you throw it away safely. Atom of Thoughts breaks problems into a DAG and contracts it so each state depends only on the current sub-problem, not the trail behind it — a 'memoryless' reasoning that stays coherent Can reasoning systems forget history without losing coherence?. Recursive subtask trees push this further, pruning the KV cache aggressively so a single model can sustain reasoning well past its context limit and even stand in for a multi-agent system Can recursive subtask trees overcome context window limits?. Separation isn't just cleaner — it's what makes forgetting non-destructive.

Decomposition also fixes a specific failure mode of single-stream reasoning: wandering. Reasoning models tend to explore like tourists, abandoning good paths too early ('underthinking') and chasing invalid ones Why do reasoning models abandon promising solution paths?. Generating explicit *abstractions* first and then solving against them enforces structured breadth-first exploration, and spending test-time compute on diverse abstractions beats just sampling more solutions Can abstractions guide exploration better than depth alone?. The plan layer becomes a scaffold that keeps execution from drifting.

One caveat worth carrying away: separation helps because a lot of what fills a single reasoning trace isn't computation at all. Chain of Draft matches verbose chain-of-thought accuracy at 7.6% of the tokens — the other 92% was style and documentation, not work Can minimal reasoning chains match full explanations? — and dynamic intervention can prune ~75% of steps (the verification and backtracking ones almost nothing downstream attends to) with accuracy intact Can reasoning steps be dynamically pruned without losing accuracy?. If much of in-line reasoning is padding, it makes sense that promoting the genuinely structural part — the decomposition — into its own stage is where the real leverage lives.

Sources 9 notes

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can reasoning and tool execution be truly decoupled?

ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

How does separating decomposition from execution improve multi-step reasoning?

Sources 9 notes

Next inquiring lines