INQUIRING LINE

Does self-play feedback improve skills created from the agent's own experience?

This explores whether an agent can bootstrap better skills purely from feedback it generates against itself — using its own play and experience as the training signal — or whether that loop needs something from outside to actually improve.


This explores whether an agent can bootstrap better skills purely from feedback it generates against itself — playing and judging its own attempts — rather than learning from human demonstrations or external rewards. The corpus says yes, with a sharp caveat: self-play does work, but only when something inside the loop plays the role external feedback normally would. Ctx2Skill is the cleanest example: it manufactures the missing reward by splitting the agent into three roles — a Challenger that escalates difficulty into a curriculum, a Judge that hands down binary verdicts, and a learner that rewrites its own skills in plain language. Skills co-evolve unsupervised, but the system only holds together with a generalization safeguard that keeps adversarial pressure from collapsing the whole thing Can language models learn skills without human supervision?. That safeguard is the tell: unconstrained self-revision is the failure mode, not the goal.

The reason experience-derived skills are worth improving at all comes from a separate strand. Agents trained only on expert demonstrations are capped by what their curators imagined — they never touch the environment, never see their own failures, and can't generalize past the demonstrated cases Can agents learn beyond what their training data shows?. The proposed escape is to treat the consequences of your own actions as supervision: across eight environments, agents using the future states their actions produce matched expert-dependent baselines with half the data Can agents learn from their own actions without external rewards?. So the raw material of self-generated experience is genuinely informative — the open question is what kind of feedback turns it into better skills.

Here the corpus gets specific about guardrails. SkillOpt shows that bounded editing beats free self-revision: textual 'learning-rate' budgets, held-out validation gates, and — crucially — keeping a buffer of rejected edits stop the agent from drifting into overfit, incoherent skill libraries Does constraining edits help agents improve their own skills?. SkillOS adds a structural twist: don't let the executor curate its own skills at all. A separately trained curator, decoupled from a frozen executor, shifts repositories away from verbose generic additions toward actionable, cross-task meta-strategies — and that learned curation transfers across different backbones Can a separate trained curator improve skill libraries better than frozen agents?. Both findings converge on the same point: the improvement comes less from self-play itself than from the constraint discipline wrapped around it.

The deepest answer, though, is a warning. Pure self-improvement is structurally circular — it stalls on the generation-verification gap, diversity collapse, and reward hacking. What looks like 'self'-improvement in the methods that actually work is smuggling in external anchors: past model versions, third-party judges, tool feedback, user corrections Can models reliably improve themselves without external feedback?. Read this way, Ctx2Skill's Judge and SkillOS's trained curator aren't really the agent feeding back on itself — they're externalized critics living inside the system. A related line argues that today's self-improvement loops are brittle because their metacognition is hand-designed by humans and breaks under domain shift; genuine self-improvement would require agents that generate their own adaptive learning strategies, which the field treats as still unsolved Can AI systems improve their own learning strategies?.

If you want one non-obvious thread to pull: scalar self-play rewards may be leaving information on the table. Natural feedback decomposes into two orthogonal signals — evaluative ('how well did that go') and directive ('how should it change') — and a binary Judge verdict captures only the first, discarding the directional specifics that token-level distillation can recover Can scalar rewards capture all the information in agent feedback?. So the better question may not be whether self-play feedback improves experience-built skills, but whether thumbs-up/thumbs-down self-judgment is even the right shape of feedback to be giving. For models that internalize the critic entirely — learning to compute their own reward in the unused space after their output, at zero inference cost — see Can models learn to evaluate their own work during training?.


Sources 9 notes

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Does constraining edits help agents improve their own skills?

SkillOpt's ablations show that textual learning-rate budgets, held-out validation gates, and retained failed edits outperform uncontrolled self-revision. Control mechanisms prevent drift toward overfitting and incoherence without sacrificing adaptability.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can AI systems improve their own learning strategies?

Current self-improvement methods use extrinsic, fixed metacognitive loops designed by humans that fail under domain shift or capability changes. True self-improvement requires agents to generate their own adaptive metacognitive knowledge, planning, and evaluation—a gap confirmed as a neglected research area across neuro-symbolic AI.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Next inquiring lines