SPICE: Self-Play In Corpus Environments Improves Reasoning

Paper · arXiv 2510.24684 · Published October 28, 2025

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner’s capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

However, most existing self-play methods for language models achieve initial improvements but quickly face fundamental barriers. Without external grounding, models inevitably plateau or collapse (Huang et al., 2025; Chen et al., 2025b; Kuba et al., 2025) due to two critical issues: (1) hallucination amplification, where factual errors in both generated questions and answers compound as models train on their own unverifiable synthetic data, and (2) information symmetry, where both the problem generator and solver share the same knowledge base, preventing genuine challenge and leading to simpler, more repetitive patterns. Even approaches maintaining diversity through variational synthesis (Liang et al., 2025) ultimately remain bounded by their initial coverage, which is merely a compressed representation of the original pretraining data (Morris et al., 2025). These systematic empirical failures indicate that self-improvement requires interaction with an external source providing diverse, verifiable feedback, rather than closed-loop pure introspection.

A single model acts in two roles: a Challenger that constructs a curriculum of such challenging document-grounded tasks, and a Reasoner that develops robust reasoning capabilities by solving the tasks without document access. A key component is information asymmetry: the Challenger grounds questions and gold answers in retrieved documents unseen by the Reasoner, creating genuine challenge. The vast diversity of documents ensures continual novelty beyond the model’s internalized knowledge. Simultaneously, corpus grounding prevents hallucination by anchoring both questions and gold answers in real-world content rather than model-generated fantasies, ensuring factual accuracy throughout the self-play loop.

During self-play, the Challenger is rewarded for generating problems at the frontier of the Reasoner’s capability (maximizing variance in success rates), while the Reasoner is rewarded for correct answers. This adversarial yet symbiotic interaction with corpus grounding enables the system to continuously discover new challenges grounded in real knowledge and overcome them. Importantly, our approach relies on raw documents without predefined questions or labels. Tasks are generated with diverse formats (multiple-choice questions and free-form questions with integer/expression/string answers) which serve as universal verifiers, enabling self-play across any language domain without requiring specialized executors or rule-based validators. This breaks the verification bottleneck that has confined prior work to narrow domains like mathematics and code, while document-grounded answers ensure verification remains factually anchored.