Can language models learn skills without human supervision?
Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?
Ctx2Skill closes the skill-construction loop without human annotation or an external reward signal by running a three-role self-play loop. A Challenger generates probing tasks and rubrics against a context; a Reasoner attempts them guided by its current skill set; a neutral Judge issues binary pass/fail feedback. The signal is internal — easily-solved tasks are routed back to strengthen the Challenger, while failed cases are routed to Proposer and Generator agents that synthesize targeted skill updates for the Reasoner. Both sides evolve through accumulated natural-language skills rather than parameter updates.
This matters because it dissolves the two bottlenecks that block automated skill construction: the prohibitive cost of manually annotating skills for long, dense contexts, and the absence of external feedback to tell automated construction what to improve. Self-play manufactures the missing feedback — the Challenger's escalating difficulty is the curriculum, and the Judge's binary verdict is the reward — so the system bootstraps a skill set for an arbitrary context from nothing but the context itself.
The counterpoint, which the paper takes seriously, is adversarial collapse: a Challenger free to maximize difficulty drifts toward extreme tasks, and a Reasoner chasing them accumulates over-specialized skills that no longer generalize. Self-play that only ratchets pressure destroys itself. This is why Ctx2Skill needs a separate replay mechanism to anchor generality — which is the tension worth tracking. Therefore the insight is real but conditional: unsupervised co-evolution of language skills works only when adversarial pressure is balanced against a generalization safeguard.
— "From Context to Skills: Can Language Models Learn from Context Skillfully?", https://arxiv.org/abs/2604.27660
Related concepts in this collection
-
Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
same proposer-vs-solver self-play structure; Ctx2Skill adds a third neutral Judge role and evolves natural-language skills rather than model weights
-
Does creating skills inside the agent loop eliminate mismatches?
Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
both ground skill creation in the agent's own experience; Ctx2Skill manufactures the missing feedback via self-play where MUSE manufactures it via in-loop invocation
-
Can skill documents be optimized like neural network weights?
Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
shares the generalization-collapse risk: Ctx2Skill needs a replay safeguard against adversarial drift much as SkillOpt needs a held-out gate to prevent overfitting self-edits
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
challenger-reasoner-judge self-play can co-evolve natural-language skills with no human supervision