Can reasoning RL work without verifying generated answers?
Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?
DeepSeek-R1-Zero-style RL training has produced remarkable gains in math and code — but only because those domains have rule-based verifiers (answer checking, test cases). Extending this paradigm to chemistry, healthcare, law, biology, and economics has been blocked by the answer verification requirement. Model-based verifiers (using an LLM to check answers) are the standard workaround, but they introduce reward hacking vulnerability, depend on a strong verifier LLM, and add significant compute overhead from maintaining the verifier in memory.
VeriFree (2025) offers a structurally different solution: skip verification entirely. Given a question, the model generates only the reasoning trace, which is then concatenated with the reference answer from the dataset. The likelihood of the reference answer conditioned on the question and generated reasoning trace serves dual purposes: (1) reward signal for policy gradients on the reasoning trace, and (2) weighting term for supervised training of the reference answer.
The intuition: a good reasoning trace will make the reference answer more likely. If the model reasons correctly about why a molecule has certain properties, the probability of generating the correct molecular description increases. The reasoning trace's quality is measured by how well it "leads to" the known answer — without ever needing to verify whether the model's own generated answer matches.
This connects to two existing verifier-free approaches. Can adversarial training replace task-specific verifiers for reasoning? (RARO) uses adversarial IRL to learn rewards from demonstrations. VeriFree takes a simpler path — no learned reward model at all, just the reference answer's conditional probability. Since Does RL teach reasoning or just when to use it?, the reasoning capability is already latent; VeriFree provides the reward signal that activates it in domains where verification was previously impossible.
The practical consequence: R1-Zero-style training is no longer limited to math and code. Any domain with reference answers (even approximate or noisy ones) can now use RL for reasoning improvement.
Source: Reward Models — Reinforcing General Reasoning without Verifiers (arxiv 2505.21493)
Related concepts in this collection
-
Can adversarial training replace task-specific verifiers for reasoning?
Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.
alternative verifier-free approach via IRL; VeriFree uses reference-answer likelihood instead
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
VeriFree confirms reasoning is latent, just needs appropriate reward signal
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
VeriFree provides a minimal signal (reference-answer likelihood) that unlocks reasoning
-
Why doesn't mathematical reasoning transfer to medicine?
Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
VeriFree provides RL path for domain-specific reasoning where SFT fails
-
Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
RLPR and INTUITOR extend the verifier-free progression further: VeriFree conditions on reference-answer likelihood, RLPR uses intrinsic token probabilities, INTUITOR uses pure self-certainty — progressively weaker assumptions about required external signal
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
verifier-free rl extends reasoning reinforcement to general domains by conditioning on reference answer likelihood rather than verifying generated answers