Can tool adaptation work without freezing the agent in the loop?

This explores whether an agent's tools and skills can keep evolving while the agent stays live in its loop — rather than being paused, retrained, or held frozen while adaptation happens elsewhere.

This reads the question two ways, and the corpus answers both: can tools change while the agent keeps running, and does keeping the agent's weights frozen actually block adaptation? The cleanest map of the territory is a 2x2 that splits adaptation by what you optimize (the agent vs. its tools) and what feedback you use (execution signals vs. final output) How do agentic AI systems decompose into adaptation paradigms?. That framing matters here because it shows 'adapt the agent' and 'adapt the tools' are separable axes — you don't have to touch the agent to get better behavior.

The strongest 'yes' comes from work that deliberately freezes the executor and moves all the learning into the tool layer. SkillOS trains a separate curator that reshapes the skill repository while the executing agent stays fixed — and the curator's improvements transfer across different executor backbones, which means the intelligence lived in the tools, not the frozen agent Can a separate trained curator improve skill libraries better than frozen agents?. VOYAGER makes the same bet from the opposite direction: store skills in an external, composable library so the agent learns continuously without weight updates — and specifically dodges the catastrophic forgetting that comes from gradient-based learning Can agents learn new skills without forgetting old ones?. Several memory-centric results push further, showing frozen models improve purely through the *shape* of what's stored: causal-form memory that preserves applicability conditions beats generic reflection and transfers to new environments Can frozen language models continually improve through memory structure alone?, and extracting natural-language rules into reusable skills lifts a frozen GPT-4.1 without any retraining Can frozen models learn better by extracting context into skills?.

But there's a sharper reading of 'in the loop' — adapting tools *while the agent is mid-task*, not in some offline curation window. Here the corpus says the offline/in-loop split is itself a quality problem. MUSE-Autoskill argues that authoring skills outside the loop creates a 'situated context' mismatch, and that invoking skill creation from inside the reasoning loop — grounded in exact task state and immediate feedback — closes that gap and even transfers cleanly to other agents Does creating skills inside the agent loop eliminate mismatches?. DeepAgent makes the parallel case for tool *selection*: discovering tools dynamically during execution beats pre-retrieving a fixed set, because the agent keeps a global view and can change strategy mid-flight Can agents discover tools dynamically instead of pre-selecting them?. So adaptation in the loop isn't just possible — for long-horizon work it's the better design.

The most direct answer to 'without freezing the agent' is MetaClaw, which refuses to choose: deployed agents run two adaptation timescales at once — fast skill injection from failures with zero downtime (seconds) and slower gradient optimization during idle windows (minutes to hours) — and the two reinforce each other, since better policies surface more informative failures and richer skills enable higher-reward runs Can agents adapt without pausing service to users?. That same continuous-feedback logic shows up in memory that grows and prunes its own links from closed-loop execution Should agent memory adapt dynamically based on execution feedback? and in workflow memory that induces reusable sub-task routines on the fly, with gains of 24–51% Can agents learn reusable sub-task routines from past experience?.

The quiet surprise running underneath all of this: freezing the agent often *helps*. The papers that hold weights fixed and let tools, skills, and memory evolve aren't accepting a limitation — they're avoiding catastrophic forgetting, getting cross-backbone transfer for free, and keeping service live. The thing you'd assume is the bottleneck (a frozen model) turns out to be the feature that makes safe, continuous tool adaptation possible.

Sources 10 notes

How do agentic AI systems decompose into adaptation paradigms?

A 2x2 taxonomy based on optimization target (agent vs tool) and feedback signal (execution vs output) unifies dispersed adaptation research. This framework directly maps to implementation decisions and explains trade-offs like query quality versus final answer quality.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can frozen language models continually improve through memory structure alone?

Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Does creating skills inside the agent loop eliminate mismatches?

MUSE-Autoskill demonstrates that invoking skill creation from within the agent's reasoning loop grounds new skills in exact task context, immediate feedback, and runtime validation. In-loop skills reach 87.94% task accuracy and transfer to other agents with minimal loss, eliminating the situated context problem of offline authoring.

Can agents discover tools dynamically instead of pre-selecting them?

DeepAgent demonstrates that discovering tools as needed—rather than pre-retrieving a fixed set—enables agents to maintain global task perspective and adapt strategy mid-execution. This approach scales better for long-horizon tasks where the tool space is too large to enumerate.

Can agents adapt without pausing service to users?

MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.

Should agent memory adapt dynamically based on execution feedback?

FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can tool adaptation work without freezing the agent in the loop?

Sources 10 notes

Next inquiring lines