Can a separate trained curator improve skill libraries better than frozen agents?
Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
Reusable skills distilled from experience provide a natural substrate for self-evolving agents. The bottleneck is not whether to maintain a skill library but who curates it well. Manual curation demands expertise that does not scale to task diversity. Heuristic/prompting-based curation lacks downstream feedback. Existing RL approaches train short-horizon skill operations and miss the long-term curation policy needed for skill update and deletion.
SkillOS (2605.06614) makes two architectural decisions that combine into a third surprising result. First, decouple the trainable skill curator from the agent executor — the executor stays frozen, retrieves and applies skills, while a separate trainable curator updates the SkillRepo from accumulated experience. This makes the curator a modular component that can be optimized without retraining the underlying agent.
Second, group related tasks into training streams to provide long-horizon learning signals. Earlier trajectories update the SkillRepo; later related tasks evaluate those updates. The grouping exploits skill-relevant task dependencies — what was learned on one task is tested on adjacent tasks. Composite rewards combine downstream executor feedback with intermediate signals to attribute outcomes to specific curation decisions.
The surprising result is what the skill repository evolves into. Early in training, the curator introduces generic sections — additional guidance, tips, recommendations — that make skills more verbose without operational improvement. As training progresses, the additions shift toward actionable structures: failure-handling logic, conditional branches specifying when to deviate from defaults. Even more notably, the global organization evolves: early repositories contain narrow task-specific skills, later repositories contain meta-strategy skills covering verification, fallback planning, system search, and strategy adjustment. The curator does not merely accumulate skills — it progressively expands the repository's strategic space toward compositional cross-task control knowledge.
The most consequential downstream finding is curator generalization. The trained skill curator outperforms frontier models' zero-shot curation ability AND generalizes across different executor backbones and task domains. The curator-as-module hypothesis is empirically validated: skill curation is a distinct learnable skill, transferable independently of the executor it was trained against.
This pairs structurally with Should successful and failed episodes be processed differently? — both are RL-for-skill approaches but along different axes. SkillRL differentiates what gets stored (success demos vs failure lessons). SkillOS differentiates who learns from the storage (curator vs executor). The two are complementary: SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator. Both contribute to the condition-preservation hypothesis: the right architecture for trajectory-based learning preserves applicability conditions through structural choices rather than relying on consolidation correctness.
The architectural implication: agent self-evolution decomposes into at least three trainable subsystems — executor (rarely retrained), skill curator (RL-trained), skill repository (the artifact). The Agentic RL survey's claim that "memory becomes RL-optimizable" extends here to "skill curation becomes RL-optimizable" as a distinct optimizable axis.
Paper: SkillOS: Learning Skill Curation for Self-Evolving Agents
SkillOpt arrives at the same frozen-executor / trainable-skill decomposition from the optimization side, and tightens it. Rather than an RL curation policy, SkillOpt treats the skill document as the external state of a frozen agent and runs a text-space optimizer: a separate optimizer model converts scored rollouts into bounded add/delete/replace edits, gated by a held-out validation score — the same curator-executor split, but disciplined like weight-space training (textual learning rate, rejected-edit buffer as negative feedback, epoch-wise slow/meta update). It also strengthens the cross-harness transfer SkillOS gestures at: across six benchmarks, seven models, and three execution harnesses (direct chat, Codex, Claude Code), a Codex-trained spreadsheet skill transfers to Claude Code for a +59.7 point gain, and the deployed skill adds zero inference-time model calls. SkillOS = RL-learned curation policy; SkillOpt = validation-gated text optimization — two routes to the same frozen-agent-with-trainable-skills architecture.
Source: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills", https://arxiv.org/abs/2605.23904
Related concepts in this collection
-
Should successful and failed episodes be processed differently?
Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL is the asymmetric-trajectory variant; SkillOS is the curator-decoupling variant; complementary axes of skill-RL design
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
SkillOS is one specific instantiation of the "capabilities become RL-optimizable subsystems" pattern, with skill curation as the optimized capability
-
Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM provides the workflow-extraction mechanism; SkillOS provides the curation-policy training; the two together describe both extraction and selection of skills
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales; SkillOS decomposes it across roles (curator/executor); both extract subsystems that can be independently optimized
-
Can skill documents be optimized like neural network weights?
Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
synthesizes: same frozen-executor/trainable-skill split reached via text-space optimizer rather than RL curation
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
RL-trained skill curation decoupled from frozen executor produces repositories that evolve from generic guidance toward execution-oriented refinement and meta-strategy skills