Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains

Paper · arXiv 2503.23829 · Published March 31, 2025

However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse realworld domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains,

RLVR typically leverages reference-based signals, assuming the availability of objective ground-truth answers to determine whether model responses align with reference outcomes. In prior studies, RLVR has mainly demonstrated success on tasks with precisely structured solutions, such as mathematical reasoning or code generation, where binary verification signals (correct or incorrect) can be reliably computed with simple rule-based verifiers (Team et al., 2025; Gandhi et al., 2024; Zhang et al., 2024b). Nonetheless, the extension of RLVR to broader, more nuanced domains remains largely unexplored, due primarily to the challenges associated with verifying complex, frequently unstructured reference answers.

In this paper, we aim to extend the applicability of RLVR to domains beyond structured mathematics and coding, by investigating its performance in a diverse set of complex reasoning-intensive areas such as medicine, chemistry, psychology, economics, and education. Central to this exploration is the observation that binary correctness judgments, even on broad-domain tasks, tend to exhibit remarkable agreement across varied large language models (LLMs),

While binary rewards have been the prevalent standard across RLVR applications (Gandhi et al., 2024; Lambert et al., 2024; Guo et al., 2025; Ma et al., 2025a), they pose clear limitations—especially for unstructured tasks. Notably, our data analysis on real-world exam questions reveals that only 60.3% of mathematical problems possess single-term numerical answers verifiable by rule-based methods, with the ratio dropping further to 45.4% for complex multi-domain queries. This presents inherent challenges for binary reward schemes and demonstrates the need for richer and more granular verification mechanisms. To address these limitations, we propose incorporating soft scores obtained from generative, model-based verifiers directly into RLVR. Specifically, we compute a soft reward from the probability of a single indicative token produced by a generative verifier summarizing its assessment. Crucially, we demonstrate that it is feasible to distill effective multidomain generative verifier models based on relatively compact models (sizes as small as 7B) without conducting extensive domain-specific annotation. Instead, we employ data composed of response samples and their corresponding judgments collected during RL exploration under the supervision of a larger cross-domain generative teacher model. These noisy yet more realistic datasets promote robustness of the subsequently distilled model-based rewards.

We focus on a setting where each prompt x is accompanied by an expert-written reference answer a. Reference answers have been shown to play a crucial role in providing accurate rewards for reinforcement learning in reasoning-intensive tasks such as coding and mathematics (Shao et al., 2024). Ideally, in these domains, a response y can be objectively verified against the given reference answer a. However, in practice, this verification process may be influenced by factors such as imperfect answer extraction and matching when pattern-based verifiers are used, as well as noise introduced by automated evaluation systems, such as a reward model rϕ(x, a, y).

Since there are no ground-truth reward labels, for each (x, a, y) triple, we prompt a fixed LLM to obtain the binary judgments c ∈ {0, 1}, indicating whether y matches the reference answer a. During the RL phase, we collect the data {(x, a, y, c)} from the exploration stages and use it to fine-tune our reward models with supervised learning on c. Unlike relying on a fixed LLM to generate y, the improving actor policy produces responses with varying performance and potential formatting noise, which may enhance the robustness of the trained reward models.