Reinforcement Learning for LLMs

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Self-Questioning Language Models (SQLM) adapts asymmetric self-play from robotic manipulation (OpenAI, 2021) to language domains. Two RL agents: a proposer and a solver. Given only a topic specification (e.g., "algebra word problems"), the proposer generates questions and the solver attempts answers.

The reward structure creates natural difficulty calibration: the proposer is rewarded when problems are neither too easy nor too hard — punished for trivially solvable questions and for impossible ones. The solver is rewarded based on majority voting (sampling multiple solutions and checking consensus), serving as a proxy for correctness without ground-truth labels. For coding tasks, the proposer can generate unit tests, providing direct verifiability.

This creates an automatically calibrated curriculum. The proposer explores the space of possible problems at the frontier of the solver's capability — hard enough to be informative, not so hard as to produce only noise. As the solver improves, the proposer must generate harder problems to maintain its own reward, creating escalating difficulty without human intervention.

The mechanism addresses two fundamental limitations of self-improvement: (a) the need for external training data (the proposer generates all training problems) and (b) the need for external verification (majority voting provides approximate correctness). Both solutions are intrinsic — no human labels, no external reward models, no ground-truth answers.

The key risk inherits from Does self-consistency reliably reward correct answers during training? — the solver's majority-voting reward is the same proxy signal, vulnerable to the same reward hacking. But the proposer provides a natural counterforce: it actively searches for the solver's weaknesses, potentially surfacing problems where majority voting is miscalibrated.

The connection to intrinsic motivation research is direct — curiosity-driven exploration (prediction error, state entropy, Go-Explore) provides the theoretical foundation for why generating novel challenges produces better learning than rehearsing known solutions.


Source: Self Refinement Self Consistency Feedback — Self-Questioning Language Models (arxiv 2508.03682)

Related concepts in this collection

Concept map
15 direct connections · 100 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

asymmetric self-play enables self-improvement without external data by training a proposer to generate challenging questions for a solver