Agentic Systems and Planning Reasoning and Learning Architectures

Should successful and failed episodes be processed differently?

Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.

Note · 2026-05-18 · sourced from Reinforcement Learning
How should agents split planning from visual grounding? How does test-time scaling work at the agent level?

Existing memory-based RL methods primarily store raw trajectories. Raw trajectories are token-heavy and noise-saturated; storing them indiscriminately produces context pollution that degrades policy improvement. The alternative — uniform abstraction across all trajectories — destroys the specificity that makes the experience useful.

SkillRL (2602.08234) introduces differential processing as the load-bearing architectural choice. Successful episodes are preserved as full demonstrations — their specific action sequences are exactly what should be reused. Failed episodes are synthesized into concise failure lessons — the specifics of what went wrong don't transfer, but the abstracted lesson does. The asymmetry mirrors how human experts treat experience: remember concrete successes vividly, generalize failures into rules.

The two trajectory types feed a hierarchical SkillBank, partitioned into general skills (universal strategic guidance) and task-specific skills (task-level heuristics). The skill library co-evolves with the agent's policy through recursive failure analysis — each new RL iteration both refines the policy and updates the skill library based on what worked and what didn't.

The differential-processing claim resolves a tension across the agent-memory literature. Does agent memory degrade when continuously consolidated? shows that uniform consolidation regresses below baseline because the consolidation step strips applicability conditions. SkillRL's asymmetric treatment is the proposed fix: preserve raw episodes where the specifics matter (successes), abstract where they don't (failures-as-lessons). This is the third positive case for the condition-preservation hypothesis — alongside ReasoningBank (strategy-level distillation with conditions) and CLIN (causal abstractions preserving "may be necessary"). See ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md.

The conceptual move is that abstraction is the right operation for some trajectory types and the wrong operation for others. Treating all experience the same — uniformly raw OR uniformly abstracted — is the failure mode. The right architecture differentiates by trajectory type, with the differentiation being driven by what each type actually contributes to future decision-making.

Empirically, SkillRL achieves state-of-the-art on ALFWorld and WebShop while using substantially less context than raw-trajectory-based memory approaches. The compression comes from the abstraction-of-failures step; the performance comes from preserving the demonstrations-of-successes step. Both halves of the asymmetry are doing work.

Update (2026-05-28) — the topological expression of the success-side operation. FluxMem (2605.28773, "Rethinking Memory as Continuously Evolving Connectivity") performs the differential-processing principle's success-side step as graph topology rather than a skill library. Its Long-Term Consolidation stage clusters recurring successful trajectories and crystallizes them into stable procedural circuits — high-utility pathways that mature (monitored by a convergence metric) so that recurring tasks bypass redundant retrieval and directly activate the mature subgraph. This is SkillRL's "preserve successes as reusable demonstrations" claim recast on a heterogeneous memory graph: where SkillRL stores successful episodes as full demonstrations in a SkillBank, FluxMem stores them as crystallized connections between co-activated units. The convergence is informative — two independently developed systems land on the same operation (durably encode recurring successes for direct reuse) through different data structures, which strengthens the case that the success/failure asymmetry is a structural requirement of self-evolving agent memory, not an artifact of one architecture.


Paper: SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning — "Rethinking Memory as Continuously Evolving Connectivity", https://arxiv.org/abs/2605.28773

Related concepts in this collection

Concept map
13 direct connections · 76 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

recursive skill-augmented RL applies differential processing to trajectories — successful episodes preserved as demonstrations while failures distilled into concise lessons