Reinforcement Learning for LLMs

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Note · 2026-02-22 · sourced from RLVR
How do domain training techniques actually reshape model behavior? How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

RLVR requires verifiable rewards. Many real-world reasoning tasks lack verifiers but have abundant expert demonstrations (Stack Exchange answers, medical case notes, legal analyses). RARO (Relativistic Adversarial Reasoning Optimization) bridges this gap through Inverse Reinforcement Learning: instead of defining a reward function, it recovers one from expert behavior.

The framework sets up an adversarial game between two co-trained components:

Both are trained jointly and continuously via RL. The policy improves at producing expert-like outputs; the critic improves at distinguishing them. The adversarial dynamic creates an implicit reward function grounded in expert demonstrations rather than explicit rules.

RARO significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing, and enjoys the same robust scaling trends as RL with verifiers. This demonstrates that strong reasoning can emerge from demonstrations alone — the verifier is not a prerequisite for RL-trained reasoning, just the most convenient reward source.

The key stabilization techniques matter: naive adversarial training is notoriously unstable. The "relativistic" critic — performing pairwise comparison rather than absolute scoring — and careful training choreography are required for robust learning.

Since Can adversarial training replace task-specific verifiers for reasoning?, RARO provides the full implementation and stability analysis. Since What limits how much models can improve themselves?, RARO partially circumvents this bound: the critic co-evolves with the policy rather than remaining static, though the expert demonstrations set an ultimate quality ceiling.

The practical implication: domains rich in expert examples but lacking automated verification (medical reasoning, legal analysis, scientific writing) can now benefit from RL-trained reasoning — previously exclusive to math and code.


Source: RLVR

Related concepts in this collection

Concept map
15 direct connections · 112 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

inverse rl from expert demonstrations enables reasoning in non-verifiable domains through adversarial policy-critic co-training