Agentic Systems and Planning

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Note · 2026-05-18 · sourced from Agents Multi Architecture
What actually changes inside a model during RL training? How should agents split planning from visual grounding?

Deployed LLM agents face a fundamental tension. They must serve users continuously without interruption, yet their capabilities grow stale as the real-world task distribution drifts. The three existing approaches each address only one half of the problem. Memory-based methods store raw trajectories but cannot extract transferable behavioral patterns. Skill-based methods compress experience into reusable instructions but treat the skill library as a static database never coordinated with weight optimization. RL-based methods update model weights but require service downtime during retraining.

MetaClaw (2603.17187) names the structural fix: two fundamentally different timescales of adaptation are naturally complementary, and existing systems address only one. Behavioral heuristics ("always verify a file path before reading," "confirm before destructive commands") can be distilled within seconds from a single failed conversation and injected immediately. Improving the model's underlying policy across diverse task types requires gradient-based optimization over many trajectories, on a timescale of minutes to hours. The complementarity is missed when systems pick one timescale.

The architecture has two mutually-reinforcing mechanisms. Skill-driven fast adaptation analyzes failure trajectories and synthesizes new skills via an LLM evolver — the new skills take effect immediately with zero service downtime, just by being added to the system prompt or skill retrieval pool. Opportunistic policy optimization performs gradient-based LoRA fine-tuning using a process reward model — but triggered only during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors configurable sleep hours, system keyboard inactivity, and Google Calendar occupancy. The agent never pauses serving; weight updates happen entirely during natural downtime.

The virtuous cycle is the key claim: a better policy produces more informative failures for skill synthesis, and richer skills yield higher-reward trajectories for policy optimization. The two mechanisms feed each other across timescales.

The under-noticed contribution is the stale reward contamination problem and its fix. Once skills have evolved, trajectories collected under the old skill context carry stale rewards that would contaminate gradient updates if reused. MetaClaw introduces skill generation versioning: support data (failure trajectories consumed by skill evolution) is strictly separated from query data (post-adaptation trajectories used for RL updates). This is a non-obvious design requirement only visible once you commit to the dual-timescale architecture.

The deployment context — single agent on OpenClaw connecting to 20+ messaging channels — clarifies why the no-downtime constraint matters: the same agent must remain available across user time zones and conversational habits. Idle-window detection turns the constraint into an opportunity.


Paper: MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild

Related concepts in this collection

Concept map
15 direct connections · 95 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

continual agent adaptation requires two complementary timescales — fast skill injection from failures plus slow gradient updates during user-inactive windows