Can smaller LLMs perform tool use tasks through modular decomposition?
This explores whether smaller, cheaper LLMs can handle tool-use and agent tasks when the work is broken into modular pieces — rather than asking one big model to do everything end to end.
This explores whether smaller LLMs can punch above their weight on tool use by breaking tasks into modular parts. The corpus says yes, and it points to two different reasons why — one about cost, one about architecture. The cost argument is the bluntest: most of what an agent actually does is repetitive, well-defined language work, and small models handle those subtasks at roughly 10–30× lower cost, which makes a heterogeneous design (small models by default, large ones only when needed) the economically rational pattern rather than a compromise Can small language models handle most agent tasks?. The architectural argument is more interesting: modularity isn't just cheaper, it changes what a small model can do at all.
The key mechanism is separation. When you split the model that *plans* a task from the model that *executes* each step, accuracy improves — and notably, the decomposition skill transfers across domains while the solving skill doesn't Does separating planning from execution improve reasoning accuracy?. LLM Programs push this further by wrapping models inside explicit algorithms that hand each call only the context relevant to its step, hiding everything else — which directly addresses the capability and context-window limits that hit small models hardest Can algorithms control LLM reasoning better than LLMs alone?. Cognitive tools take the same isolation idea and show its power vividly: four reasoning operations implemented as sandboxed calls lifted GPT-4.1's AIME score from 26.7% to 43.3% with no training at all, because enforced isolation elicits reasoning that pure prompting can't reliably trigger Can modular cognitive tools unlock reasoning without training?.
For tool use specifically, decoupling pays off twice. ReWOO and Chain-of-Abstraction both pull reasoning apart from tool responses — planning before execution, or using abstract placeholders — which kills the quadratic prompt growth and sequential latency that otherwise crush a small model's limited context Can reasoning and tool execution be truly decoupled?. And on the raw skill of calling functions correctly, small models can be trained to match large ones: DPO on a teacher's correct-and-incorrect examples beats plain fine-tuning precisely because the negative examples target the rigid output-format mistakes where small models stumble Can small models match large models on function calling?. Externalizing reasoning into a structure helps too — GPT-4o mini gained 29% on hard GAIA tasks by building knowledge-graph triples instead of holding everything in its head Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?, and recursive subtask trees with cache pruning let a single model sustain reasoning past its context limit, replacing what used to need a multi-agent system Can recursive subtask trees overcome context window limits?.
Here's the thing you might not expect: modular decomposition isn't free, and the corpus quietly marks its limits. Long delegated workflows compound silent errors — frontier models corrupt about 25% of document content over extended relay tasks, and the damage doesn't plateau Do frontier LLMs silently corrupt documents in long workflows?. So chaining many small steps trades one risk (a weak model overwhelmed) for another (errors accumulating across hand-offs). And decomposition can't manufacture capability that isn't there: LLMs plateau around 55–60% on genuine constrained optimization regardless of scale Do larger language models solve constrained optimization better?, and they fall back to pattern-matching memorized templates rather than actually running iterative numerical methods Do large language models actually perform iterative optimization?. The takeaway worth leaving with: modularity works because most tool-use 'reasoning' is really orchestration — routing, formatting, and step management that a small model does fine once the hard cognitive load is structured away from it. Where a task needs a capability the model simply lacks, decomposition reorganizes the failure; it doesn't remove it.
Sources 11 notes
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.