Can skill libraries prevent redundant narrow artifacts from proliferating?
This reads 'skill libraries' as externalized stores of reusable agent skills, and asks whether they actually curb the buildup of one-off, overly-specific entries — or just become dumping grounds for them.
This explores whether storing skills outside a model's weights can stop an agent's toolkit from bloating with narrow, single-use artifacts. The corpus suggests the library structure alone doesn't prevent this — what prevents it is the abstraction and curation discipline layered on top. A skill library is only as lean as the policy that decides what gets to live in it.
The foundational case for libraries is that they let agents learn continuously without overwriting what they already know. Can agents learn new skills without forgetting old ones? (VOYAGER) stores executable skills in an embedding-indexed store and builds complex skills out of simpler ones, sidestepping the catastrophic forgetting you get from weight updates. But composition is exactly the mechanism that fights redundancy: if a new task can be assembled from existing primitives, you don't need to mint a fresh narrow skill for it. The library shrinks the surface area of things worth storing.
The sharper answer to your question comes from how artifacts get generalized before they're filed away. Can agents learn reusable sub-task routines from past experience? (Agent Workflow Memory) induces routines at *finer granularity than whole tasks* and strips out example-specific values — so instead of saving 'book a flight from NYC to LA on Tuesday,' it saves the reusable booking sub-routine. That abstraction step is the difference between a library of reusable patterns and a pile of redundant narrow snapshots, and it's where the 24–51% gains come from. Without it, you'd just be caching specific solved instances.
But the most direct evidence that libraries *don't* self-clean comes from Can a separate trained curator improve skill libraries better than frozen agents? (SkillOS). It found that left to a frozen agent, repositories drift toward 'generic verbose additions' — exactly the redundant-narrow-artifact problem. The fix was a *separate trained curator* whose whole job is to evolve the repository toward actionable execution logic and cross-task meta-skills. The proliferation is the default failure mode; preventing it required a dedicated learning process, not just a place to put things. This echoes a broader limit in What stops large language models from improving themselves?: a system can't reliably improve itself without something external validating the changes — a library can't curate itself any more than a model can verify itself.
Worth knowing as a contrast: proliferation isn't always the enemy — sometimes the opposite, collapse, is. Does RL training collapse format diversity in pretrained models? shows RL training crushing format diversity down to one dominant winner, and Can isolating task-specific parameters prevent multi-task fine-tuning interference? shows that deliberately *isolating* task-specific structure beats letting everything merge. Read together, the corpus frames a real tension: you want enough abstraction and curation to stop redundant narrow skills from piling up, but not so much convergence that you lose the distinct capabilities that made separate skills worth keeping in the first place.
Sources 6 notes
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.