UR2: Unify RAG and Reasoning through Reinforcement Learning

Paper · arXiv 2508.06165 · Published August 8, 2025

Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope—typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks.

Large Language Models (LLMs) have achieved remarkable performance across diverse tasks by incorporating external knowledge (Retrieval-Augmented Generation, RAG) [19, 1, 14] and by optimizing reasoning through reinforcement learning with verifiable rewards (RLVR) [7]. RAG methods enable LLMs to access external knowledge, while RLVR has shown strong gains on mathematical and logical reasoning [43, 2]. Motivated by these successes, recent work has begun to integrate retrieval and reasoning: for example, Search-o1 [22] embeds an agentic RAG workflow into the LLM’s chain-of-thought, and RAG-Gym [39] proposes a unified RL-based training framework for RAG agents. Similarly, RAG-RL methods—which learn to invoke retrieval through RL—such as R1-Searcher [31] and Search-R1 [15] use RLVR to train models on when and what to retrieve during reasoning, improving answer accuracy in open-domain QA. Despite recent progress, RAG-RL frameworks remain limited in scope. Most methods focus narrowly on open-domain QA, with retrieval tied to fixed reasoning steps or static knowledge sources like Wikipedia. For instance, R1-Searcher and Search-R1 assume access to broad knowledge bases, ill-suited for tasks requiring specialized or real-time information. While methods like DeepResearcher attempt training in real web environments, they face inefficiencies due to the noisy and unstructured nature of online data [45]. Other methods like ZeroSearch [34], use LLM-generated corpora to simulate retrieval, avoiding API costs but risking hallucination and loss of real-world complexity. These limitations highlight the need for a more versatile framework that can support broad-domain reasoning and flexibly integrate dynamic knowledge into reasoning.

To address the limitations of existing RAG-RL approaches—such as static retrieval, limited domain generalization, and poor robustness in noisy environments—we propose a general and adaptive framework, UR2 (Unified RAG and Reasoning), which uses RL to dynamically coordinate retrieval and reasoning. Unlike prior methods that rely solely on static corpora (e.g., Wikipedia) or simulate retrieval with synthetic content, UR2 combines both: it leverages task-specific offline corpora for accurate grounding, augmented with LLM-generated summaries for efficiency and generalization. To address the imbalance between retrieval and reasoning in prior methods, we design a difficulty-aware curriculum that adaptively controls when to trigger retrieval during training. Specifically, retrieval is used only for harder instances, encouraging the model to rely on internal reasoning when possible and to learn retrieval strategies only when necessary. This reduces retrieval overhead, improves query quality on challenging questions, and preserves reasoning capability across tasks.

2.1 Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external information to reduce hallucinations [6]. Early RAG methods concatenate retrieved documents with input prompts [19, 14, 1]. Subsequent approaches have evolved in multiple directions: advanced RAG methods incorporate sophisticated retrieval and re-ranking strategies [6, 25]; post-hoc verification methods address hallucinations by retrieving documents based on generated responses [21, 33]; and Graph-based RAG methods integrate knowledge graphs for multihop reasoning [5, 13, 25]. Recent RL-RAG frameworks have explored retrieval integration during training via real-time or synthetic retrieval [45, 34]. However, these approaches remain constrained by static retrieval strategies, limited domain generalization, and inability to dynamically coordinate retrieval with reasoning across diverse task types.

3.1.2 Difficulty-Aware Curriculum Design

We organize this part into two components: (1) training data selection based on difficulty levels; and (2) task mixing strategy to balance retrieval and reasoning exposure.

Trainning Data Selection. To promote fine-grained reasoning and retrieval behaviors, we categorize training samples by their difficulty levels. For each question, we perform multiple rollouts using a baseline model and compute the average performance score (s). Based on the score s, questions are categorized into three difficulty levels: Easy (0.8 ≤ s ≤ 1.0), Medium (0.5 ≤ s < 0.8), and Hard (0.2 ≤ s < 0.5).