Reinforcement Learning for LLMs

Can reasoning RL work without verifying generated answers?

Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

DeepSeek-R1-Zero-style RL training has produced remarkable gains in math and code — but only because those domains have rule-based verifiers (answer checking, test cases). Extending this paradigm to chemistry, healthcare, law, biology, and economics has been blocked by the answer verification requirement. Model-based verifiers (using an LLM to check answers) are the standard workaround, but they introduce reward hacking vulnerability, depend on a strong verifier LLM, and add significant compute overhead from maintaining the verifier in memory.

VeriFree (2025) offers a structurally different solution: skip verification entirely. Given a question, the model generates only the reasoning trace, which is then concatenated with the reference answer from the dataset. The likelihood of the reference answer conditioned on the question and generated reasoning trace serves dual purposes: (1) reward signal for policy gradients on the reasoning trace, and (2) weighting term for supervised training of the reference answer.

The intuition: a good reasoning trace will make the reference answer more likely. If the model reasons correctly about why a molecule has certain properties, the probability of generating the correct molecular description increases. The reasoning trace's quality is measured by how well it "leads to" the known answer — without ever needing to verify whether the model's own generated answer matches.

This connects to two existing verifier-free approaches. Can adversarial training replace task-specific verifiers for reasoning? (RARO) uses adversarial IRL to learn rewards from demonstrations. VeriFree takes a simpler path — no learned reward model at all, just the reference answer's conditional probability. Since Does RL teach reasoning or just when to use it?, the reasoning capability is already latent; VeriFree provides the reward signal that activates it in domains where verification was previously impossible.

The practical consequence: R1-Zero-style training is no longer limited to math and code. Any domain with reference answers (even approximate or noisy ones) can now use RL for reasoning improvement.


Source: Reward Models — Reinforcing General Reasoning without Verifiers (arxiv 2505.21493)

Related concepts in this collection

Concept map
15 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

verifier-free rl extends reasoning reinforcement to general domains by conditioning on reference answer likelihood rather than verifying generated answers