DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Paper · arXiv 2511.19399 · Published November 24, 2025
EvolutionSelf Refinement Self Consistency FeedbackReinforcement Learning

Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query.

In this paper, we introduce Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research tasks. To address the challenge of verification in long-form tasks, DR Tulu is first finetuned on high-quality, naturally occurring user data, and then trained via a new method we call Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training. As Figure 2 illustrates, at each training step, we sample several responses and search traces from the model and generate new rubrics that capture and contrast the good and bad points of these responses. This allows us to dynamically incorporate newly explored information into the rubrics and to ensure that they provide on-policy feedback that can discriminate among model responses.

3.2 Evolving Rubrics

Designing rubrics for long-form deep research tasks is particularly challenging. First, long-form questions are often under-specified and admit many plausible ways a response could be good or bad, so a small set of fixed criteria cannot capture all relevant dimensions of quality. Second, DR tasks are highly knowledge-intensive: reliable evaluation requires checking claims against a broad, evolving corpus of world knowledge, rather than relying solely on an LM’s parametric knowledge. As a result, closed-book rubrics generated directly by an LM can miss critical evidence, fail to distinguish subtle errors, and are vulnerable to reward hacking by models that exploit judge biases.

We address these challenges by constructing rubrics that co-evolve with the policy model and are grounded on searched knowledge from the internet. Specifically, instead of trying to exhaustively enumerate all possible desiderata, our method generates rubrics tailored to the current policy model’s behaviors, offering on-policy feedback the model can effectively learn from. Furthermore, the rubrics are generated with retrieval, ensuring it can cover the needed knowledge to assess the generation. A detailed illustration on the core RLER training process is in Figure 3.

Search-based and evolving rubrics make verification criteria more concrete and factual. Table 1 compares the specificity of four rubric types. We define a rubric as assertive if it is specific and concrete about what the response should contain (e.g., “The response should mention benchmarks A and B”), and descriptive otherwise (e.g., “The response should discuss benchmarks.”). Descriptive rubrics are easier to generate since they do not require factual knowledge, but they often fail to assess response quality accurately, as a model may score well by superficially mentioning a point or even hallucinating facts. We measure the fraction of assertive rubrics and factuality using an LM, with experimental details provided in Appendix B. As shown in Table 1, general rubrics lack specific evaluation criteria, and instance-wise rubrics generated by a closed-book LM are relatively vague (only 22% are assertive). In contrast, initial search-based rubrics and evolving search-based rubrics are more concrete, with over 50% of claims being assertive. These advantages come from search-based rubrics being grounded in retrieved information, and from evolving rubrics being generated using search context, which makes them better suited for training.

Evolving rubrics adjust the evaluation criteria as the policy model evolves. Static rubrics can fail to capture unexpected behaviors or insights emerging during training. As an illustration, we conducted RL training on a single question, “Write a survey paper about RAG.” (details in Appendix B). Unexpectedly, some rollouts contained Python code (e.g., Figure 15 in Appendix B), an artifact of the Qwen model that was also previously reported by Shao et al. (2025); this is undesirable but hard for an initial rubric to anticipate. In contrast, evolving rubrics identify these issues and provide negative feedback about irrelevant code, leading to fewer code-containing responses during training (Figure 4).