Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
Deployed LLM agents face a fundamental tension. They must serve users continuously without interruption, yet their capabilities grow stale as the real-world task distribution drifts. The three existing approaches each address only one half of the problem. Memory-based methods store raw trajectories but cannot extract transferable behavioral patterns. Skill-based methods compress experience into reusable instructions but treat the skill library as a static database never coordinated with weight optimization. RL-based methods update model weights but require service downtime during retraining.
MetaClaw (2603.17187) names the structural fix: two fundamentally different timescales of adaptation are naturally complementary, and existing systems address only one. Behavioral heuristics ("always verify a file path before reading," "confirm before destructive commands") can be distilled within seconds from a single failed conversation and injected immediately. Improving the model's underlying policy across diverse task types requires gradient-based optimization over many trajectories, on a timescale of minutes to hours. The complementarity is missed when systems pick one timescale.
The architecture has two mutually-reinforcing mechanisms. Skill-driven fast adaptation analyzes failure trajectories and synthesizes new skills via an LLM evolver — the new skills take effect immediately with zero service downtime, just by being added to the system prompt or skill retrieval pool. Opportunistic policy optimization performs gradient-based LoRA fine-tuning using a process reward model — but triggered only during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors configurable sleep hours, system keyboard inactivity, and Google Calendar occupancy. The agent never pauses serving; weight updates happen entirely during natural downtime.
The virtuous cycle is the key claim: a better policy produces more informative failures for skill synthesis, and richer skills yield higher-reward trajectories for policy optimization. The two mechanisms feed each other across timescales.
The under-noticed contribution is the stale reward contamination problem and its fix. Once skills have evolved, trajectories collected under the old skill context carry stale rewards that would contaminate gradient updates if reused. MetaClaw introduces skill generation versioning: support data (failure trajectories consumed by skill evolution) is strictly separated from query data (post-adaptation trajectories used for RL updates). This is a non-obvious design requirement only visible once you commit to the dual-timescale architecture.
The deployment context — single agent on OpenClaw connecting to 20+ messaging channels — clarifies why the no-downtime constraint matters: the same agent must remain available across user time zones and conversational habits. Idle-window detection turns the constraint into an opportunity.
Paper: MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild
Related concepts in this collection
-
Can agents learn continuously from experience without updating weights?
This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
AgentFly addresses continual adaptation via memory only (one timescale); MetaClaw adds the gradient-update timescale alongside
-
Should successful and failed episodes be processed differently?
Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL operates at single-timescale within RL training; MetaClaw separates the skill-update timescale from the weight-update timescale to remove the downtime constraint
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
MetaClaw's versioning of skill generations is the engineering response to a related fragility: stale skill contexts contaminate weight updates if reused
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
MetaClaw instantiates the POMDP framing with TWO RL-optimizable subsystems (skills and weights) at different timescales
-
Can splitting adaptation into two channels reduce forgetting?
When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
exemplifies: the same fast/slow dual-timescale architecture in the agent setting
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
continual agent adaptation requires two complementary timescales — fast skill injection from failures plus slow gradient updates during user-inactive windows