Escaping the Verifier: Learning to Reason via Demonstrations

Paper · arXiv 2511.21667 · Published November 26, 2025
RLVRReward ModelsReinforcement Learning

Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks — Countdown, DeepMath, and Poetry Writing — and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

Without preference data, the typical approach to improving LLM performance in these domains is to conduct Supervised Fine-Tuning (SFT) on expert demonstration data via the next-token prediction objective. However, such methods, even if the data are further annotated with reasoning traces, does not encourage the same reasoning behaviors elicited from large-scale RL training on verifiable tasks (Chu et al., 2025). Additionally, naive next-token prediction objective induces training-inference distribution mismatch: during training, the model conditions only on the dataset contexts, whereas at inference, it conditions on self-sampled contexts. Training on self-sampled contexts, as occurs during RL, yields lower training-inference mismatch, leading to better performance at test time (Ross et al., 2011). Thus, we hypothesize that leveraging expert demonstrations in conjunction with RL could cultivate robust reasoning abilities, leading to improved performance on downstream tasks and offering a new pathway for developing reasoning capabilities in non-verifiable domains.

While RLVR is effective for training LLMs to reason on readily verifiable tasks, it does not directly extend to the broader setting of learning reasoning on real-world domains with no verifiers, yet many of these tasks could still benefit from explicit reasoning (Zhou et al., 2025). Although no consensus method exists for general reasoning learning to our knowledge, several recent efforts make early progress. Zhou et al. (2025) and Gurung & Lapata (2025) propose to train LLMs to reason with reward derived from the model’s own logits on expert answers rather than from an external verifier. Jia et al. (2025) propose a pairwise generative reward model with a PPOstyle objective for non-verifiable writing tasks, achieving gains without external training signals. Gunjal et al. (2025) propose using an LLM-as-judge (Gu et al., 2025) together with pre-generated rubrics from strong LLM to provide rewards for non-verifiable tasks. Ma et al. (2025) distill a model-based verifier from a strong teacher to train general reasoners without rule-based verifiers. Li et al. (2025) investigate large-scale multi-task RLVR, hypothesizing that breadth across many tasks induces stronger general reasoning. We build on this line of work while adopting a demonstration only setting and a complementary perspective based on Inverse Reinforcement Learning.

2.3 INVERSE REINFORCEMENT LEARNING Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000) studies the task of recovering a reward function for which an observed expert policy is near-optimal. A seminal application is robust imitation learning, most notably Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016), casting imitation as an adversarial game between a policy and a discriminator.

Sun & van der Schaar (2025) recently investigated the application of IRL for aligning LLMs with expert demonstrations. They show that a classifier trained in the IRL paradigm can serve as an effective reward model for Best-of-N sampling. However, their work stops short of exploring stable, joint adversarial training, or reasoning-intensive tasks, where the model must learn to navigate complex solution spaces rather than aligning with surface-level preferences.

We study the general setting where we are given an expert Question–Answer (QA) dataset, and we aim to train a reasoning LLM policy to produce expert-level answers via explicit CoT reasoning. We adopt this setting because verifiable tasks are relatively scarce, whereas expert demonstration data are abundant for many non-verifiable domains (e.g., highly upvoted Stack Exchange answers). To approach this task, we propose a novel inverse reinforcement learning framework that sets up an adversarial interaction between a reasoning policy and a relativistic critic: the policy learns to output expert-like answers, while the critic learns to discriminate between policy and expert answers via pairwise comparison. By jointly training both the policy and the critic to reason via RL, we enable the emergence of strong reasoning capabilities from demonstrations alone, without requiring task-specific verifiers.