Reinforcement Learning for LLMs

Can three-way rewards fix the accuracy versus abstention problem?

Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?

Note · 2026-02-23 · sourced from Alignment
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Standard RL for language models uses binary reward: correct or incorrect. This creates a forced trade-off. Optimizing for accuracy pushes the model to always answer, amplifying hallucinations. Optimizing for caution encourages abstention, sacrificing correct answers. Both extremes compromise truthfulness.

TruthRL introduces a ternary reward that treats correct answers, hallucinations, and abstentions as three distinct outcomes with different reward values. The key insight is that abstention should receive an intermediate reward — not as good as a correct answer, but better than a hallucination. This makes "I don't know" a learnable response that the model can select when genuinely uncertain.

The approach includes knowledge boundary probing: for each training question, 256 responses are sampled. If none is correct, the question is marked as out-of-knowledge (OOK) and relabeled with "I don't know" as the ground truth. This gives the model explicit examples of when abstention is appropriate, based on its own capability boundaries.

Results across four knowledge-intensive benchmarks: 28.9% reduction in hallucinations and 21.1% improvement in truthfulness compared to vanilla RL. Consistent gains across Qwen and Llama backbones under both retrieval and non-retrieval setups.

This directly addresses the problem identified in Does reasoning fine-tuning make models worse at declining to answer?. Standard reasoning training degrades abstention because the binary reward doesn't value it. Ternary reward restores the abstention signal. Similarly, it complements Does binary reward training hurt model calibration? — both papers address the inadequacy of binary rewards, but from different angles: calibration via scoring rules vs truthfulness via ternary outcomes.


Source: Alignment

Related concepts in this collection

Concept map
12 direct connections · 117 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

ternary reward that distinguishes correct answers hallucinations and abstentions solves the accuracy-abstention trade-off in RL for truthfulness