Can reasoning and tool execution run in parallel?
Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?
Standard tool-augmented LLM architectures interleave reasoning and tool calls: the model halts for each tool response, then resumes with the full prior context re-fed into the prompt (because black-box LLM APIs are stateless). This creates two compounding costs — prompt redundancy that grows quadratically with reasoning steps, and sequential inference latency that accumulates tool response delays.
Two architectures converge on the same solution from different angles:
ReWOO (Planner/Worker/Solver): The Planner produces a complete reasoning blueprint — all planned tool calls — before any tool is executed. The Worker executes the plan in batch. The Solver synthesizes plan + evidence into an answer. No tool-response-dependent re-feeding occurs between steps. Token usage drops dramatically because prior context is not re-fed on each API call.
Chain-of-Abstraction (CoA): The LLM generates reasoning chains with abstract placeholders (y1, y2, y3) rather than concrete values. Tools fill in the placeholders in parallel. Crucially: the LLM can start generating the next abstract reasoning chain while the tool fills the current one. Sequential waiting is replaced by pipeline parallelism.
The synthesis: both architectures achieve the same goal — removing the dependency between reasoning steps and tool responses — but through different mechanisms. ReWOO separates by planning horizon; CoA separates by abstracting over content.
This is distinct from the How should we balance parallel versus sequential compute at test time? framing, which concerns token budget allocation. Architectural decoupling reduces both prompt redundancy (cost) and execution latency (speed) regardless of total token budget.
The implication for agentic system design: sequential tool-call loops are an architectural default, not a necessity. Planning-before-execution and abstract-placeholder approaches each demonstrate that reasoning and retrieval/computation can be parallelized, dramatically reducing inference costs in production.
Source: Reasoning Architectures
Related concepts in this collection
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
architectural decoupling is a third option that changes the terms of the trade-off
-
Can retrieval be scaled like reasoning at test time?
Standard RAG retrieves once, but multi-hop tasks need adaptive retrieval. Can we train models to plan retrieval chains and vary their length at test time to improve accuracy, the way test-time scaling works for reasoning?
CoRAG interleaves retrieval and generation iteratively; contrast with CoA which separates them
-
When should retrieval actually help versus hurt reasoning?
Retrieval augmentation seems universally beneficial, but does it always improve reasoning? This explores whether some reasoning steps benefit from internal knowledge alone, and when external retrieval introduces harmful noise rather than useful information.
DeepRAG makes sequential decisions per step; contrast with CoA's parallel approach
-
Can reasoning stay grounded without external feedback loops?
Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.
ReAct is the sequential baseline these architectures improve upon
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
decoupling reasoning from tool observations eliminates prompt redundancy and enables parallel tool execution