Agentic Systems and Planning

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Note · 2026-05-18 · sourced from Agents Multi Architecture
How should agents split planning from visual grounding? What actually changes inside a model during RL training?

Reusable skills distilled from experience provide a natural substrate for self-evolving agents. The bottleneck is not whether to maintain a skill library but who curates it well. Manual curation demands expertise that does not scale to task diversity. Heuristic/prompting-based curation lacks downstream feedback. Existing RL approaches train short-horizon skill operations and miss the long-term curation policy needed for skill update and deletion.

SkillOS (2605.06614) makes two architectural decisions that combine into a third surprising result. First, decouple the trainable skill curator from the agent executor — the executor stays frozen, retrieves and applies skills, while a separate trainable curator updates the SkillRepo from accumulated experience. This makes the curator a modular component that can be optimized without retraining the underlying agent.

Second, group related tasks into training streams to provide long-horizon learning signals. Earlier trajectories update the SkillRepo; later related tasks evaluate those updates. The grouping exploits skill-relevant task dependencies — what was learned on one task is tested on adjacent tasks. Composite rewards combine downstream executor feedback with intermediate signals to attribute outcomes to specific curation decisions.

The surprising result is what the skill repository evolves into. Early in training, the curator introduces generic sections — additional guidance, tips, recommendations — that make skills more verbose without operational improvement. As training progresses, the additions shift toward actionable structures: failure-handling logic, conditional branches specifying when to deviate from defaults. Even more notably, the global organization evolves: early repositories contain narrow task-specific skills, later repositories contain meta-strategy skills covering verification, fallback planning, system search, and strategy adjustment. The curator does not merely accumulate skills — it progressively expands the repository's strategic space toward compositional cross-task control knowledge.

The most consequential downstream finding is curator generalization. The trained skill curator outperforms frontier models' zero-shot curation ability AND generalizes across different executor backbones and task domains. The curator-as-module hypothesis is empirically validated: skill curation is a distinct learnable skill, transferable independently of the executor it was trained against.

This pairs structurally with Should successful and failed episodes be processed differently? — both are RL-for-skill approaches but along different axes. SkillRL differentiates what gets stored (success demos vs failure lessons). SkillOS differentiates who learns from the storage (curator vs executor). The two are complementary: SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator. Both contribute to the condition-preservation hypothesis: the right architecture for trajectory-based learning preserves applicability conditions through structural choices rather than relying on consolidation correctness.

The architectural implication: agent self-evolution decomposes into at least three trainable subsystems — executor (rarely retrained), skill curator (RL-trained), skill repository (the artifact). The Agentic RL survey's claim that "memory becomes RL-optimizable" extends here to "skill curation becomes RL-optimizable" as a distinct optimizable axis.


Paper: SkillOS: Learning Skill Curation for Self-Evolving Agents

SkillOpt arrives at the same frozen-executor / trainable-skill decomposition from the optimization side, and tightens it. Rather than an RL curation policy, SkillOpt treats the skill document as the external state of a frozen agent and runs a text-space optimizer: a separate optimizer model converts scored rollouts into bounded add/delete/replace edits, gated by a held-out validation score — the same curator-executor split, but disciplined like weight-space training (textual learning rate, rejected-edit buffer as negative feedback, epoch-wise slow/meta update). It also strengthens the cross-harness transfer SkillOS gestures at: across six benchmarks, seven models, and three execution harnesses (direct chat, Codex, Claude Code), a Codex-trained spreadsheet skill transfers to Claude Code for a +59.7 point gain, and the deployed skill adds zero inference-time model calls. SkillOS = RL-learned curation policy; SkillOpt = validation-gated text optimization — two routes to the same frozen-agent-with-trainable-skills architecture.

Source: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills", https://arxiv.org/abs/2605.23904

Related concepts in this collection

Concept map
14 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

RL-trained skill curation decoupled from frozen executor produces repositories that evolve from generic guidance toward execution-oriented refinement and meta-strategy skills