Can individual skills improve through reuse and accumulate experience across tasks?
This explores whether an agent can get better at reusable skills by repeating them and folding what it learns from each task back into a growing library — rather than treating every task as a fresh start.
This reads the question as: can a skill itself sharpen through reuse, and can experience pile up across tasks instead of evaporating when each one ends? The corpus says yes — but the interesting part is where the improvement actually lives. The dominant answer is that skills accumulate *outside* the model's weights, in an external library the agent reads and rewrites. VOYAGER stores executable skills in an embedding-indexed library and builds complex skills out of simpler ones, which lets it keep learning without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Agent Workflow Memory pushes the same idea finer-grained: instead of memorizing whole tasks, it extracts reusable *sub-task* routines, strips out the example-specific details, and compounds them — and the payoff grows (24–51%) precisely as new tasks drift further from the training ones Can agents learn reusable sub-task routines from past experience?. So reuse isn't just retrieval; abstraction is what makes a skill portable.
The sharper finding is that *how* you fold experience back in matters as much as *that* you do. SkillRL shows the accumulation shouldn't be uniform: successes get stored as concrete demonstrations, failures get abstracted into lessons — an asymmetry that mirrors how human experts reason and avoids the degradation you get from treating every episode the same Should successful and failed episodes be processed differently?. AgentFly takes the no-weight-update stance to its logical end: it formalizes learning as a memory-augmented decision process where credit assignment and policy improvement happen entirely through memory operations, and still hits 87.88% on GAIA without touching the model's parameters Can agents learn continuously from experience without updating weights?.
What you didn't ask but probably want to know: a frozen agent curating its own library isn't the best curator. SkillOS separates a *trainable* curator from a frozen executor and finds the repository shifts on its own from generic, verbose entries toward actionable execution logic and cross-task meta-strategies — and that learned curator generalizes across different model backbones Can a separate trained curator improve skill libraries better than frozen agents?. SkillClaw extends accumulation past the single agent entirely, aggregating interaction trajectories across many users so siloed individual learning becomes shared capability How can agent systems share learned skills across users?. Experience, in other words, can compound not just across one agent's tasks but across a whole population of them.
Two cautions sit underneath all of this. The first is *why* externalized libraries are the favored route at all: weight-update approaches forget. That's the whole motivation behind keeping skills in a library rather than fine-tuning them in Can agents learn new skills without forgetting old ones? — though Transformer2 offers a weight-side counterpoint, composing task-specific expert vectors at inference time so specialization can continue without interference Can models dynamically activate expert skills at inference time?. The second is a ceiling: imitation-style shortcuts can make a model *look* improved while closing no real capability gap, because base-model competence sets the limit Can imitating ChatGPT fool evaluators into thinking models improved?. Reuse and accumulated experience genuinely raise performance — but they compound what the model can already do, not what it fundamentally can't.
Sources 8 notes
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.