Learning to Reason for Factuality

Paper · arXiv 2508.05618 · Published August 7, 2025

Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning.

This leads us to the following research question:

RQ: Can we learn reasoning strategies that improve the factuality of an (R-)LLM?

Traditionally, RL alignment optimizes for verifiable rewards (RLVR, Lambert et al., 2025) in domains such as mathematics and programming, or human preferences (RLHF, Ouyang et al., 2022) for general instruction following. In contrast, factuality, especially in long-form generations, does not lend itself well to either approach. There is no reliable method to deterministically and accurately verify the factuality of a long-form response, and human verification requires significant manual effort, making it both expensive and time-consuming (Min et al., 2023). Although there are automatic evaluation frameworks for long-form factuality, such as FActScore (Min et al., 2023), which have been employed in previous factuality alignment work (Tian et al., 2023; Lin et al., 2024), these methods are limited to offline RL where they create pairwise preference data for Direct Preference Optimization (DPO, Rafailov et al., 2023). In contrast, online RL offers notable advantages: it is integral to recent advances in R-LLMs (DeepSeek-AI, 2025b), and prior work consistentlydemonstrates the benefits of training on on-policy data (e.g. self-generated responses) for improving factual accuracy (Lin et al., 2024; Zhang et al., 2024, inter alia.). However, applying online RL to learn factual reasoning in long-form responses remains an open problem with several outstanding challenges.

The first challenge lies in the reward design. In our experiments, we find that optimizing solely towards a factuality reward may result in unintended outcomes. The model learns to produce much shorter and less detailed responses, as a shortcut to achieve higher factual precision, because it is significantly easier for an LLM to generate a single correct fact than to produce a detailed answer containing, for example, 50 facts without any hallucinations, even though both would have a perfect factual precision. Furthermore, even if the reward manages to consider both factuality and the level of detail in the answer, it remains possible to falsely inflate (hack) the reward by producing less pertinent, or in extreme cases, irrelevant answers. Consider the following extreme example: a model recites the same Wikipedia article, which is both factual and detailed, in response to every question it is asked. Such a model would be utterly useless, yet it would achieve very high scores in both factuality and detail level. Last but not least, existing automatic long-form factuality evaluation methods, which typically involve LLM-based atomic claim extraction and verification along with web searches to find relevant evidence documents, are very time-consuming. This makes them unsuitable for real-time reward calculation in online RL. For instance, VeriScore (Song et al., 2024), a recent long-form factuality evaluation method, can take several minutes to verify a single response.

In this work, we propose the first online RL recipe for long-form factuality, with a novel reward function that addresses these challenges. In particular, our factual reasoning reward has three components to mitigate the various ways of hacking the reward described above: it considers (1) factual precision, (2) response detail level, and (3) answer relevance at the same time. For computing (1) and (2) we implement an optimized and scalable version of VeriScore, achieving up to a 30x speedup, which makes it suitable for real-time reward calculation in online RL rollouts. For (3) we combine these rewards with the overall quality of the response measured using LLM-as-a-Judge. We evaluate our method on six long-form factuality benchmarks, including LongFact (Wei et al., 2024), FAVA (Mishra et al., 2024), AlpacaFact (Dubois et al., 2024; Lin et al., 2024), Biography (Min et al., 2023), FactBench (Bayat et al., 2024), and FACTORY (Chen et al., 2025a), showing that our factual reasoning model trained with online RL using GRPO (Shao et al., 2024) achieves an average of 23.1 points higher factuality precision while producing 23% more factual statements in the responses, without degradation in the overall response helpfulness (LLM-as-a-judge win rate >50% over the base model).

we adopt a new approach to curating the training prompt set that i) is likely to appear in real-world scenarios, and ii) incentivize factuality as a major factor of a high-quality response. In particular, we adopt Llama 42 to generate synthetic prompts by providing two sets of grounding prompts as demonstrations: one set of real-world diverse prompts from WildChat (Zhao et al., 2024) and another set of fact-seeking prompts from the non-test split of LongFact (Wei et al., 2024). The goal is for the model to generate prompts that are diverse and likely asked by real humans, similar to the examples in the first group of grounding prompts, while also requiring factual knowledge to provide a good answer, as seen in the second group of grounding prompts.

2.2 Supervised Finetuning (SFT)

In preliminary experiments, we find it beneficial to first perform supervised finetuning (SFT) before applying RL algorithms. In particular, in offline DPO experiments, the model struggles to consistently follow the Long CoT reasoning format without SFT, even after additional preference data on format-following was added. On the other hand, SFT effectively teaches the model to follow the Long CoT format to produce a reasoning chain (wrapped in and ) before generating the final answer (wrapped in and ). In online RL experiments, while it is possible to directly apply RL to produce Long CoT responses in the correct format, SFT on seed factual reasoning data further provides a useful inductive bias for the later RL stage, which stabilizes training and leads to higher-quality responses in practice.