Should successful and failed episodes be processed differently?
Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
Existing memory-based RL methods primarily store raw trajectories. Raw trajectories are token-heavy and noise-saturated; storing them indiscriminately produces context pollution that degrades policy improvement. The alternative — uniform abstraction across all trajectories — destroys the specificity that makes the experience useful.
SkillRL (2602.08234) introduces differential processing as the load-bearing architectural choice. Successful episodes are preserved as full demonstrations — their specific action sequences are exactly what should be reused. Failed episodes are synthesized into concise failure lessons — the specifics of what went wrong don't transfer, but the abstracted lesson does. The asymmetry mirrors how human experts treat experience: remember concrete successes vividly, generalize failures into rules.
The two trajectory types feed a hierarchical SkillBank, partitioned into general skills (universal strategic guidance) and task-specific skills (task-level heuristics). The skill library co-evolves with the agent's policy through recursive failure analysis — each new RL iteration both refines the policy and updates the skill library based on what worked and what didn't.
The differential-processing claim resolves a tension across the agent-memory literature. Does agent memory degrade when continuously consolidated? shows that uniform consolidation regresses below baseline because the consolidation step strips applicability conditions. SkillRL's asymmetric treatment is the proposed fix: preserve raw episodes where the specifics matter (successes), abstract where they don't (failures-as-lessons). This is the third positive case for the condition-preservation hypothesis — alongside ReasoningBank (strategy-level distillation with conditions) and CLIN (causal abstractions preserving "may be necessary"). See ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md.
The conceptual move is that abstraction is the right operation for some trajectory types and the wrong operation for others. Treating all experience the same — uniformly raw OR uniformly abstracted — is the failure mode. The right architecture differentiates by trajectory type, with the differentiation being driven by what each type actually contributes to future decision-making.
Empirically, SkillRL achieves state-of-the-art on ALFWorld and WebShop while using substantially less context than raw-trajectory-based memory approaches. The compression comes from the abstraction-of-failures step; the performance comes from preserving the demonstrations-of-successes step. Both halves of the asymmetry are doing work.
Update (2026-05-28) — the topological expression of the success-side operation. FluxMem (2605.28773, "Rethinking Memory as Continuously Evolving Connectivity") performs the differential-processing principle's success-side step as graph topology rather than a skill library. Its Long-Term Consolidation stage clusters recurring successful trajectories and crystallizes them into stable procedural circuits — high-utility pathways that mature (monitored by a convergence metric) so that recurring tasks bypass redundant retrieval and directly activate the mature subgraph. This is SkillRL's "preserve successes as reusable demonstrations" claim recast on a heterogeneous memory graph: where SkillRL stores successful episodes as full demonstrations in a SkillBank, FluxMem stores them as crystallized connections between co-activated units. The convergence is informative — two independently developed systems land on the same operation (durably encode recurring successes for direct reuse) through different data structures, which strengthens the case that the success/failure asymmetry is a structural requirement of self-evolving agent memory, not an artifact of one architecture.
Paper: SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning — "Rethinking Memory as Continuously Evolving Connectivity", https://arxiv.org/abs/2605.28773
Related concepts in this collection
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
diagnoses the failure mode SkillRL's differential processing addresses
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank also distills from successes AND failures but treats both as strategies; SkillRL treats successes as demonstrations (raw) and failures as lessons (abstracted) — same idea applied with different granularity
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
CLIN preserves applicability conditions via causal form; SkillRL preserves them by treating success-trajectories as raw
-
Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM compounds workflows; SkillRL compounds skills hierarchically — same compositional principle, asymmetric trajectory processing as added axis
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER is the predecessor; SkillRL adds the success-failure asymmetry and online RL refinement
-
Can a separate trained curator improve skill libraries better than frozen agents?
Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
SkillOS is the complementary axis: SkillRL differentiates *what gets stored* (success demos vs failure lessons); SkillOS differentiates *who learns from the storage* (curator vs executor). SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales using SkillRL-like failure-distillation as its fast-timescale mechanism; MetaClaw's contribution is adding the slow-timescale weight-update channel
-
Does creating skills inside the agent loop eliminate mismatches?
Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
synthesizes: both ground skills in the agent's own situated trajectory rather than out-of-loop authoring, here via in-loop creation, there via differential trajectory processing
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
recursive skill-augmented RL applies differential processing to trajectories — successful episodes preserved as demonstrations while failures distilled into concise lessons