TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy—models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that, compared to vanilla RL, TruthRL significantly reduces hallucinations by 28.9% and improves truthfulness by 21.1%, with consistent gains across various backbone models (e.g., Qwen, Llama) under both retrieval and non-retrieval setups.
Truthfulness. Let D = {(xi, yi)}Ni =1 denote the problem set. For a model fθ, we evaluate its predictions ˆyi = fθ(xi) and compute (i) accuracy (Acc), the fraction of questions answered correctly; (ii) uncertainty rate (Unc), the fraction of questions where the model abstains (e.g., answers “I don’t know”); and (iii) hallucination rate (Hall), the fraction of responses that are factually incorrect.
Reinforcement learning (RL). Traditional RL methods optimize the LLM using accuracy-based reward signals, provided by a verifier to determine whether a prediction is correct (Guo et al., 2025). Although RL typically achieves better generalization than SFT by eliminating direct supervision with ground-truth answers, vanilla RL is not explicitly designed to recognize uncertainty or abstain when appropriate. As a result, it may substantially increase correctness but still fails to prevent hallucinations (Kang et al., 2025), as also observed in our preliminary findings.
3.1 Knowledge Boundary Probing
To enable the model to express uncertainty, we first construct appropriate training data for baselines by probing an LLM’s knowledge boundary on the training set to identify out-of-knowledge (OOK) questions. For each training question, we sample 256 responses, and the question is marked as OOK if none of the responses is correct. These questions are then relabeled with “I don’t know” as the ground-truth answer and used to train the model with the standard SFT objective. Similar approaches have been explored in prior works (Zhang et al., 2024; Yang et al., 2024b; Song et al., 2025), where samples with uncertain labels are incorporated into SFT training using various data construction strategies. We refer to this baseline as R-Tuning (Zhang et al., 2024).
Further, we extend the idea of rejection sampling fine-tuning (RFT) (Yuan et al., 2023) with uncertain responses. Rather than directly learning the ground-truth answers, RFT trains the model on reasoning traces generated by the model itself. In this baseline, we prompt the model to generate multiple reasoning traces for each training question and select the trace that concludes with “I don’t know” as the target response for OOK questions, whereas for non-OOK questions, we select the trace that leads to the correct answer.