INQUIRING LINE

Can individual skills improve through reuse and accumulate experience across tasks?

This explores whether an agent can get better at reusable skills by repeating them and folding what it learns from each task back into a growing library — rather than treating every task as a fresh start.


This reads the question as: can a skill itself sharpen through reuse, and can experience pile up across tasks instead of evaporating when each one ends? The corpus says yes — but the interesting part is where the improvement actually lives. The dominant answer is that skills accumulate *outside* the model's weights, in an external library the agent reads and rewrites. VOYAGER stores executable skills in an embedding-indexed library and builds complex skills out of simpler ones, which lets it keep learning without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Agent Workflow Memory pushes the same idea finer-grained: instead of memorizing whole tasks, it extracts reusable *sub-task* routines, strips out the example-specific details, and compounds them — and the payoff grows (24–51%) precisely as new tasks drift further from the training ones Can agents learn reusable sub-task routines from past experience?. So reuse isn't just retrieval; abstraction is what makes a skill portable.

The sharper finding is that *how* you fold experience back in matters as much as *that* you do. SkillRL shows the accumulation shouldn't be uniform: successes get stored as concrete demonstrations, failures get abstracted into lessons — an asymmetry that mirrors how human experts reason and avoids the degradation you get from treating every episode the same Should successful and failed episodes be processed differently?. AgentFly takes the no-weight-update stance to its logical end: it formalizes learning as a memory-augmented decision process where credit assignment and policy improvement happen entirely through memory operations, and still hits 87.88% on GAIA without touching the model's parameters Can agents learn continuously from experience without updating weights?.

What you didn't ask but probably want to know: a frozen agent curating its own library isn't the best curator. SkillOS separates a *trainable* curator from a frozen executor and finds the repository shifts on its own from generic, verbose entries toward actionable execution logic and cross-task meta-strategies — and that learned curator generalizes across different model backbones Can a separate trained curator improve skill libraries better than frozen agents?. SkillClaw extends accumulation past the single agent entirely, aggregating interaction trajectories across many users so siloed individual learning becomes shared capability How can agent systems share learned skills across users?. Experience, in other words, can compound not just across one agent's tasks but across a whole population of them.

Two cautions sit underneath all of this. The first is *why* externalized libraries are the favored route at all: weight-update approaches forget. That's the whole motivation behind keeping skills in a library rather than fine-tuning them in Can agents learn new skills without forgetting old ones? — though Transformer2 offers a weight-side counterpoint, composing task-specific expert vectors at inference time so specialization can continue without interference Can models dynamically activate expert skills at inference time?. The second is a ceiling: imitation-style shortcuts can make a model *look* improved while closing no real capability gap, because base-model competence sets the limit Can imitating ChatGPT fool evaluators into thinking models improved?. Reuse and accumulated experience genuinely raise performance — but they compound what the model can already do, not what it fundamentally can't.


Sources 8 notes

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

How can agent systems share learned skills across users?

SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating whether individual agent skills sharpen through reuse and accumulate across tasks. This question remains open despite recent work.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as time-stamped, not current.
- External skill libraries (not weight updates) prevent catastrophic forgetting and enable compound learning; VOYAGER and Agent Workflow Memory show 24–51% improvement gains as task drift increases, with abstraction as the key to portability (2024–2025).
- Asymmetric experience folding works: successes stored as concrete demos, failures as abstracted lessons, mirrors expert reasoning and outperforms uniform storage (SkillRL, ~2024).
- Memory-augmented frozen agents can hit 87.88% on GAIA without parameter updates, pushing credit assignment and policy improvement entirely into memory operations (AgentFly, ~2025).
- Trainable curators (decoupled from frozen executors) shift repositories from verbose to actionable logic and generalize across model backbones; learned curation outperforms static design (SkillOS, 2026).
- Cross-user skill aggregation compounds individual learning into shared population-level capability (SkillClaw, 2026).

Anchor papers (verify; mind their dates):
- arXiv:2409.07429 (Agent Workflow Memory, 2024)
- arXiv:2605.06614 (SkillOS, 2026)
- arXiv:2604.08377 (SkillClaw, 2026)
- arXiv:2501.06252 (Transformer2, 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For weight-update approaches: does Transformer2's in-context expert composition or recent parameter-efficient fine-tuning actually relax the "forgetting" ceiling that justified externalized libraries? For memory-based agents: have new evaluation harnesses, long-horizon benchmarks (GAIA variants), or multi-agent orchestration changed what frozen agents can achieve? Has the imitation-style ceiling (model-imitation gaps persisting despite superficial improvement) been tightened or dissolved by stronger base models or better abstraction? Cite what relaxed or held each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially papers showing weight updates *do* accumulate skills durably, or external libraries *fail* to generalize, or that skill reuse trades capability for illusion.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If base-model competence truly sets the ceiling, does learned curation (SkillOS) actually relax it or merely optimize within it? (b) Can population-level skill aggregation (SkillClaw) escape the single-agent imitation ceiling if agents themselves hallucinate?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines