Reinforcing General Reasoning without Verifiers

Paper · arXiv 2505.21493 · Published May 27, 2025
Reward ModelsRLVRReinforcement Learning

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.

The simplicity of this approach, coupled with impressive performance improvements in mathematical reasoning tasks, has sparked a wave of follow-up works in this paradigm of RL with rule-based verifiable rewards [24, 26, 45], which we will refer to as the R1-Zero-style training in the following. However, these methods remain limited to domains such as mathematics and code, where rule-based verification is feasible. Reasoning is critical far beyond math and coding; however, the difficulty of answer verification in general reasoning tasks poses a major obstacle to applying this training paradigm to broader domains. To address this limitation, we investigate how to extend R1-Zero-style training to tasks where rule-based answer verification is not possible.

A natural extension, as explored in recent general reasoning works [38, 27], is to introduce a specialized LLM as a verifier, similar to the reward model used in RL from human feedback (RLHF) [51, 30]. In these methods, the model-based verifier is queried to determine whether the generated answer is equivalent to the reference answer. Although this approach bypasses the need for rule-based evaluation, it introduces several potential drawbacks (as in standard RLHF): it depends on the availability of a strong verifier LLM, it converts the R1-Zero-style paradigm into optimizing a model-based reward, which makes it vulnerable to reward hacking [7], and it adds significant computational overhead by requiring an additional model to be held in memory and queried during training.

In this work, we propose an alternative: a verifier-free approach that preserves the benefits of the RL paradigm while eliminating the reliance on explicit verification, either performed by rules or by models. Our method proceeds as follows. Given a question, we only generate the reasoning trace and concatenate it with the reference answer from the dataset. We then evaluate the likelihood of the reference answer conditioned on the question and the generated reasoning trace. This likelihood serves both as a reward signal for policy gradients on the reasoning trace and as a weighting term for supervised training of the reference answer. We term our method VeriFree since it does not rely on rule- or model-based verifiers, and give an illustration in Fig. 2.