Reasoning and Learning Architectures Agentic Systems and Planning Reasoning and Knowledge

Can language models learn skills without human supervision?

Can a three-role self-play system—Challenger, Reasoner, Judge—bootstrap natural-language skills from raw context alone, without human labels or external reward signals?

Note · 2026-05-28 · sourced from Context Engineering

Ctx2Skill closes the skill-construction loop without human annotation or an external reward signal by running a three-role self-play loop. A Challenger generates probing tasks and rubrics against a context; a Reasoner attempts them guided by its current skill set; a neutral Judge issues binary pass/fail feedback. The signal is internal — easily-solved tasks are routed back to strengthen the Challenger, while failed cases are routed to Proposer and Generator agents that synthesize targeted skill updates for the Reasoner. Both sides evolve through accumulated natural-language skills rather than parameter updates.

This matters because it dissolves the two bottlenecks that block automated skill construction: the prohibitive cost of manually annotating skills for long, dense contexts, and the absence of external feedback to tell automated construction what to improve. Self-play manufactures the missing feedback — the Challenger's escalating difficulty is the curriculum, and the Judge's binary verdict is the reward — so the system bootstraps a skill set for an arbitrary context from nothing but the context itself.

The counterpoint, which the paper takes seriously, is adversarial collapse: a Challenger free to maximize difficulty drifts toward extreme tasks, and a Reasoner chasing them accumulates over-specialized skills that no longer generalize. Self-play that only ratchets pressure destroys itself. This is why Ctx2Skill needs a separate replay mechanism to anchor generality — which is the tension worth tracking. Therefore the insight is real but conditional: unsupervised co-evolution of language skills works only when adversarial pressure is balanced against a generalization safeguard.


— "From Context to Skills: Can Language Models Learn from Context Skillfully?", https://arxiv.org/abs/2604.27660

Related concepts in this collection

Concept map
13 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

challenger-reasoner-judge self-play can co-evolve natural-language skills with no human supervision