How do agents discover and select which tools to invoke?
This explores how agents figure out which tools exist and pick the right one to call — and the corpus reveals a live debate over whether that choice should happen up front, on the fly, or be driven by the model itself.
This explores how agents figure out which tools exist and pick the right one to call. The corpus frames this less as a solved retrieval problem and more as an open design question with several competing answers — the most interesting split being *when* selection happens. The traditional approach pre-retrieves a fixed tool set before the task starts, but DeepAgent argues that discovering tools dynamically during execution works better for long, multi-step tasks: the agent keeps a global view and can change strategy mid-run instead of being locked into whatever it grabbed at the outset Can agents discover tools dynamically instead of pre-selecting them?. The tool space is often simply too large to enumerate in advance, so deferring the choice becomes a feature, not a compromise.
A second thread shifts *who* does the selecting. Rather than a passive retriever matching the user's phrasing to tool descriptions, MCP-Zero lets the model emit structured tool requests itself, refining what it needs across turns as its reasoning unfolds Can models decide better than retrievers which tools to use?. This sidesteps a quiet failure mode — the mismatch between how a user casually describes a need and the formal vocabulary a tool is registered under. The model, mid-reasoning, knows better than a one-shot semantic match what it's actually after.
But more selection freedom cuts both ways. A production-side note pushes back hard: protocol-mediated tool access (like MCP) introduced non-deterministic failures precisely through *ambiguous* tool selection and shaky parameter inference, and many teams restored reliability by going to explicit direct function calls with a single tool per agent Why do protocol-based tool integrations fail in production workflows?. So the corpus contains a genuine tension — the same flexibility that helps long-horizon exploration is the thing production engineers strip out to get predictable behavior.
Two adjacent ideas reframe the question entirely. One is that an agent shouldn't always reach for a tool at all — conversation analysis offers a formal account of when an agent should pause and *ask the user* instead of silently chaining tool calls and drifting from intent When should AI agents ask users instead of just searching?. The other is memory: agents can learn and store reusable sub-task routines from past runs, so 'which tool' becomes 'which proven workflow,' with measured gains as tasks repeat Can agents learn reusable sub-task routines from past experience?. Selection here is partly a learning problem, not just a retrieval one.
If you want the deeper structural view, two notes zoom out: decoupling reasoning from tool observations (planning the tool sequence before executing, à la ReWOO) changes the selection dynamic by separating *what to call* from *what came back* Can reasoning and tool execution be truly decoupled?, and representing agents as optimizable computational graphs suggests tool-routing decisions could be tuned automatically rather than hand-designed Can we automatically optimize both prompts and agent coordination?. The thing you might not have expected: there's no consensus that more autonomy in tool choice is better — the field is actively pulling between adaptive discovery and deterministic constraint.
Sources 7 notes
DeepAgent demonstrates that discovering tools as needed—rather than pre-retrieving a fixed set—enables agents to maintain global task perspective and adapt strategy mid-execution. This approach scales better for long-horizon tasks where the tool space is too large to enumerate.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.