Can language models improve themselves without any external training data?
Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
Self-Questioning Language Models (SQLM) adapts asymmetric self-play from robotic manipulation (OpenAI, 2021) to language domains. Two RL agents: a proposer and a solver. Given only a topic specification (e.g., "algebra word problems"), the proposer generates questions and the solver attempts answers.
The reward structure creates natural difficulty calibration: the proposer is rewarded when problems are neither too easy nor too hard — punished for trivially solvable questions and for impossible ones. The solver is rewarded based on majority voting (sampling multiple solutions and checking consensus), serving as a proxy for correctness without ground-truth labels. For coding tasks, the proposer can generate unit tests, providing direct verifiability.
This creates an automatically calibrated curriculum. The proposer explores the space of possible problems at the frontier of the solver's capability — hard enough to be informative, not so hard as to produce only noise. As the solver improves, the proposer must generate harder problems to maintain its own reward, creating escalating difficulty without human intervention.
The mechanism addresses two fundamental limitations of self-improvement: (a) the need for external training data (the proposer generates all training problems) and (b) the need for external verification (majority voting provides approximate correctness). Both solutions are intrinsic — no human labels, no external reward models, no ground-truth answers.
The key risk inherits from Does self-consistency reliably reward correct answers during training? — the solver's majority-voting reward is the same proxy signal, vulnerable to the same reward hacking. But the proposer provides a natural counterforce: it actively searches for the solver's weaknesses, potentially surfacing problems where majority voting is miscalibrated.
The connection to intrinsic motivation research is direct — curiosity-driven exploration (prediction error, state entropy, Go-Explore) provides the theoretical foundation for why generating novel challenges produces better learning than rehearsing known solutions.
Source: Self Refinement Self Consistency Feedback — Self-Questioning Language Models (arxiv 2508.03682)
Related concepts in this collection
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
SQLM inherits the proxy reward risk; the proposer partially mitigates by adversarially targeting weaknesses
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SQLM creates problems in the gap region by design (neither trivial nor impossible)
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
SQLM creates a natural curriculum but from self-play rather than from budget scheduling
-
Can tree search replace human feedback in LLM training?
Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
parallel unsupervised self-improvement mechanism: MCTS derives quality signals from tree-search outcomes while asymmetric self-play derives training data from proposer-solver dynamics; both solve the annotation bottleneck but through different structures — MCTS explores within a fixed problem space, self-play generates new problems at the solver's frontier
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
asymmetric self-play enables self-improvement without external data by training a proposer to generate challenging questions for a solver