Can three-way rewards fix the accuracy versus abstention problem?
Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?
Standard RL for language models uses binary reward: correct or incorrect. This creates a forced trade-off. Optimizing for accuracy pushes the model to always answer, amplifying hallucinations. Optimizing for caution encourages abstention, sacrificing correct answers. Both extremes compromise truthfulness.
TruthRL introduces a ternary reward that treats correct answers, hallucinations, and abstentions as three distinct outcomes with different reward values. The key insight is that abstention should receive an intermediate reward — not as good as a correct answer, but better than a hallucination. This makes "I don't know" a learnable response that the model can select when genuinely uncertain.
The approach includes knowledge boundary probing: for each training question, 256 responses are sampled. If none is correct, the question is marked as out-of-knowledge (OOK) and relabeled with "I don't know" as the ground truth. This gives the model explicit examples of when abstention is appropriate, based on its own capability boundaries.
Results across four knowledge-intensive benchmarks: 28.9% reduction in hallucinations and 21.1% improvement in truthfulness compared to vanilla RL. Consistent gains across Qwen and Llama backbones under both retrieval and non-retrieval setups.
This directly addresses the problem identified in Does reasoning fine-tuning make models worse at declining to answer?. Standard reasoning training degrades abstention because the binary reward doesn't value it. Ternary reward restores the abstention signal. Similarly, it complements Does binary reward training hurt model calibration? — both papers address the inadequacy of binary rewards, but from different angles: calibration via scoring rules vs truthfulness via ternary outcomes.
Source: Alignment
Related concepts in this collection
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
the problem TruthRL solves: standard training destroys abstention capacity
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
complementary approach: proper scoring rules for calibration vs ternary for truthfulness
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
a third approach to the same binary-reward inadequacy
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
ternary reward directly addresses this bidirectional problem: by making abstention a learnable intermediate-reward option, it provides a mechanism to correct both under-abstention (reasoning-trained) and over-abstention (safety-trained) toward calibrated abstention
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
ternary reward that distinguishes correct answers hallucinations and abstentions solves the accuracy-abstention trade-off in RL for truthfulness