Can approximate or noisy reference answers work for RL-based reasoning training?

This explores whether RL reasoning training still works when the 'ground truth' is fuzzy — soft, approximate, or even wrong reference signals — instead of a clean verifier checking exact answers.

This explores whether RL reasoning training still works when the 'ground truth' is fuzzy — soft, approximate, or even wrong reference signals — instead of a clean verifier checking exact answers. The short version the corpus suggests: yes, and surprisingly far. The reference answer doesn't need to be checked for exact correctness — it just needs to nudge the model's probability mass in the right direction. VeriFree throws out verification entirely and instead rewards reasoning traces by how likely they make the reference answer, treating that conditional probability as both the reward and the training weight — and it matches or beats verifier-based methods on hard benchmarks like GPQA Can reasoning improvement work without answer verification?. That reframes the whole question: a soft, probabilistic match to an answer can carry the same signal as a hard right/wrong check.

Once you accept that the supervision can be soft, a cluster of papers shows it doesn't even need to come from an answer at all. RLSF uses the model's own answer-span confidence to rank traces, building synthetic preferences with no human labels or verifier Can model confidence work as a reward signal for reasoning?. DRO reuses a single self-supervised statistic — how much rollouts vary — as both a token-level reward and a query filter, working on tasks where no verifier exists Can one statistical measure serve dual purposes in RL training?. L2T derives dense per-step rewards from information theory with zero annotation Can we reward reasoning steps without human annotation?, and RARO replaces task-specific verifiers with an adversarial critic that just tries to tell expert answers from the policy's, scaling like verifier-based RL across domains as different as math and poetry Can adversarial critics replace task-specific verifiers for reasoning?. The reference signal, in other words, can be approximate, intrinsic, or learned — and still train reasoning.

The most provocative evidence pushes past 'noisy' into 'wrong.' Models trained on deliberately corrupted, semantically irrelevant reasoning traces perform comparably to those trained on correct ones, and sometimes generalize *better* out of distribution Do reasoning traces need to be semantically correct?. That hints at why noisy references can work: the traces and rewards may function as computational scaffolding that triggers reasoning the model already has, rather than as factual content it learns from.

That connects to a deeper thread running through the corpus — RL post-training mostly *elicits* latent capability rather than installing it. Base models already carry reasoning ability that minimal training unlocks through several independent mechanisms Do base models already contain hidden reasoning ability?, and RL seems to teach a model *when* to reason rather than *how*, with hybrid models recovering most of the gains by routing alone Does RL post-training create reasoning or just deploy it?. If the reward's job is to select and time pre-existing behavior, a coarse or approximate signal is enough to point the way — you don't need a precise teacher to flip a switch that's already wired.

But 'works' deserves a caveat the corpus insists on. Approximate signals optimize what you actually reward, and crude rewards corrupt quietly. Binary correctness rewards provably wreck calibration by encouraging confident guessing — fixable by adding a proper scoring rule like the Brier score Does binary reward training hurt model calibration?. Worse, the gains can be illusory: RL-tuned models still crater on out-of-distribution variants, suggesting they sharpen memorized templates rather than learn procedures Do fine-tuned language models actually learn optimization procedures?, and standard accuracy metrics can rise even as genuine reasoning-step quality drops Does supervised fine-tuning improve reasoning or just answers?. So noisy references can train reasoning — but whether they train *reasoning* or just better answer-matching depends entirely on what your approximate signal is secretly rewarding.

Sources 11 notes

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can approximate or noisy reference answers work for RL-based reasoning training?

Sources 11 notes

Next inquiring lines