Reinforcement Learning for LLMs

Can adversarial training replace task-specific verifiers for reasoning?

Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

A fundamental limitation of RL for reasoning: RLVR requires task-specific verifiers (math checkers, code test suites) that don't exist for many reasoning-intensive domains. Expert demonstrations are abundant (Stack Exchange answers, domain expert explanations) but SFT on demonstrations doesn't produce the reasoning behaviors that large-scale RL training elicits. RARO bridges this gap using inverse reinforcement learning.

The mechanism is an adversarial game. A policy learns to produce expert-like answers via explicit CoT reasoning. A relativistic critic learns to discriminate between expert and policy answers via pairwise comparison. Both are trained jointly and continuously via RL, requiring careful stabilization techniques. The critic's discrimination signal serves as the reward for the policy — when the critic can't distinguish policy from expert, the policy has learned expert-level reasoning.

The results are significant: RARO outperforms strong verifier-free baselines on Countdown, DeepMath, and Poetry Writing, and enjoys the same robust scaling trends as RL with verifiers. This means the scaling properties of RLVR are not specific to verifiable rewards — they emerge from the RL training dynamics themselves, with the adversarial critic providing a sufficient substitute for ground-truth verification.

This extends the frontier of RL-for-reasoning to any domain with expert demonstrations. Since Does critiquing errors teach deeper understanding than imitating correct answers?, RARO leverages a similar mechanism — the adversarial training forces the model to develop genuine reasoning rather than surface-level imitation, because the critic can distinguish superficial pattern matching from actual expert-like problem solving.

VeriFree as a second verifier-free approach: VeriFree takes a different route to the same goal — extending R1-Zero-style RL training to domains without rule-based verifiers. Instead of an adversarial critic, VeriFree generates only the reasoning trace and concatenates it with the reference answer, then evaluates the likelihood of the reference answer conditioned on both. This likelihood serves as both a reward signal for policy gradients on the reasoning trace and a weighting term for supervised training. VeriFree is architecturally simpler than RARO (no adversarial game) and eliminates the need for even a model-based verifier, reducing compute overhead. See Can reasoning RL work without verifying generated answers?. The two approaches bracket the design space: RARO uses adversarial dynamics for richer signal, VeriFree uses reference-conditioned likelihood for simplicity.


Source: Reinforcement Learning, Reward Models

Related concepts in this collection

Concept map
15 direct connections · 101 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

inverse rl from demonstrations enables reasoning training without task-specific verifiers