What training method supports dynamic tool discovery in long-horizon agents?

This explores how you'd actually *train* an agent to find tools on the fly during a long task — rather than handing it a fixed toolbox up front — and what the corpus says about which learning approaches make that work.

This explores how you'd actually train an agent to discover tools mid-task instead of pre-selecting them, and the corpus points somewhere surprising: the most promising methods barely touch the model's weights at all. The starting point is the observation that dynamic discovery beats predefined tool sets — DeepAgent shows that finding tools as you go lets an agent keep a global view of the task and change strategy partway through, which matters most exactly when the tool space is too big to enumerate in advance Can agents discover tools dynamically instead of pre-selecting them?. So the question becomes: what training regime produces an agent that can do that?

The most direct answer is memory-based online reinforcement learning. AgentFly reframes the whole problem as a memory-augmented decision process with three separate stores — one for past cases, one for subtasks, and crucially one for tools — and improves its policy entirely through memory operations, no parameter updates, reaching 87.88% on GAIA Can agents learn continuously from experience without updating weights?. The tool memory is the part that supports discovery: the agent accumulates and retrieves tool experience rather than having it baked into frozen weights. This sidesteps a real hazard the corpus documents — RL on agents tends to collapse behavioral diversity, narrowing exploration onto a few reward-maximizing paths the same way it does in reasoning models, which is the opposite of what you want when the next useful tool is one you haven't tried Does reinforcement learning squeeze exploration diversity in search agents?.

The deeper pattern the corpus keeps circling is *externalizing the learning* into a library the agent reads and writes, rather than into gradients. VOYAGER stores executable skills in an embedding-indexed library and composes complex behaviors from simpler ones, learning continuously without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Agent Workflow Memory does the same at finer grain — inducing reusable sub-task routines and compounding them hierarchically, with gains that *grow* (24–51%) as the gap between training and test situations widens Can agents learn reusable sub-task routines from past experience?. Both treat 'discovery' as a retrieval-and-composition problem over an evolving store, which is why they scale to long horizons.

What you didn't know to ask is whether the library should be passive or actively trained. SkillOS splits the system in two: a *frozen* executor and a *trainable* curator whose only job is to evolve the repository. The trained curator shifts the library away from generic verbose entries toward actionable execution logic and cross-task meta-strategies — and it generalizes across different executor backbones Can a separate trained curator improve skill libraries better than frozen agents?. That's a striking inversion of the usual instinct: you don't train the agent to use tools better, you train a separate component to curate *what's discoverable*. A related move at the multi-agent level treats discovery itself as a first-class indexed operation, using versioned capability vectors so the right capability surfaces by semantic match instead of manual wiring Can semantic capability vectors replace manual agent routing?.

The through-line: for long-horizon agents, the corpus favors memory-and-curation methods over weight-update training. The discovery happens at inference through retrieval over a growing store, and the 'training' that helps is either online memory updates (AgentFly) or a dedicated curator learning to shape the store (SkillOS) — both of which dodge the forgetting and diversity-collapse failures that plague fine-tuning-style approaches.

Sources 7 notes

Can agents discover tools dynamically instead of pre-selecting them?

DeepAgent demonstrates that discovering tools as needed—rather than pre-retrieving a fixed set—enables agents to maintain global task perspective and adapt strategy mid-execution. This approach scales better for long-horizon tasks where the tool space is too large to enumerate.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

What training method supports dynamic tool discovery in long-horizon agents?

Sources 7 notes

Next inquiring lines