Can reasoning emerge from expert demonstrations alone?
Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.
RLVR requires verifiable rewards. Many real-world reasoning tasks lack verifiers but have abundant expert demonstrations (Stack Exchange answers, medical case notes, legal analyses). RARO (Relativistic Adversarial Reasoning Optimization) bridges this gap through Inverse Reinforcement Learning: instead of defining a reward function, it recovers one from expert behavior.
The framework sets up an adversarial game between two co-trained components:
- A reasoning policy that learns to produce expert-level answers via explicit Chain-of-Thought reasoning
- A relativistic critic that learns to discriminate between expert and policy answers via pairwise comparison
Both are trained jointly and continuously via RL. The policy improves at producing expert-like outputs; the critic improves at distinguishing them. The adversarial dynamic creates an implicit reward function grounded in expert demonstrations rather than explicit rules.
RARO significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing, and enjoys the same robust scaling trends as RL with verifiers. This demonstrates that strong reasoning can emerge from demonstrations alone — the verifier is not a prerequisite for RL-trained reasoning, just the most convenient reward source.
The key stabilization techniques matter: naive adversarial training is notoriously unstable. The "relativistic" critic — performing pairwise comparison rather than absolute scoring — and careful training choreography are required for robust learning.
Since Can adversarial training replace task-specific verifiers for reasoning?, RARO provides the full implementation and stability analysis. Since What limits how much models can improve themselves?, RARO partially circumvents this bound: the critic co-evolves with the policy rather than remaining static, though the expert demonstrations set an ultimate quality ceiling.
The practical implication: domains rich in expert examples but lacking automated verification (medical reasoning, legal analysis, scientific writing) can now benefit from RL-trained reasoning — previously exclusive to math and code.
Source: RLVR
Related concepts in this collection
-
Can adversarial training replace task-specific verifiers for reasoning?
Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.
RARO is the full adversarial implementation
-
Why do self-improvement loops eventually stop improving?
Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?
RARO's co-trained critic operationalizes this principle
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
expert demonstrations set the ceiling rather than generation-verification gap
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
the critic component learns evaluation through adversarial training
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
inverse rl from expert demonstrations enables reasoning in non-verifiable domains through adversarial policy-critic co-training