Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
Agentic RAG systems must make sequences of retrieval decisions — which query to issue next, which documents to process, when to stop retrieving. Training these systems on final answer accuracy alone (outcome-only reward) evaluates the end result without supervising the path. Poor intermediate retrieval decisions can accidentally produce correct final answers; good decisions can be penalized by noisy evaluation metrics.
RAG-Gym demonstrates that fine-grained process supervision — providing reward signals for individual intermediate retrieval steps, not just the final answer — substantially boosts agentic RAG performance. The improvement comes from two directions: correct retrieval steps are explicitly rewarded, and incorrect steps (retrieving irrelevant documents, issuing redundant queries) are explicitly penalized.
Three post-training algorithms were compared: PPO, DPO, and online DPO. DPO with both positive and negative feedback significantly outperforms PPO and single-direction training. The mechanism: DPO trains the model to prefer good retrieval chains over bad ones by directly contrasting them. Providing negative examples (what a bad intermediate step looks like) gives the model a gradient direction that outcome-only reward cannot supply.
The parallel to reasoning: Does failed-step fraction predict reasoning quality better? shows that in reasoning chains, intermediate step quality predicts final quality better than global features. RAG-Gym shows the same at the agentic level: retrieval step quality determines answer quality better than final-answer reward alone can capture.
Source: RAG
Related concepts in this collection
-
Does failed-step fraction predict reasoning quality better?
Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.
same principle at the reasoning level; intermediate step quality predicts outcome quality; the insight transfers from reasoning chains to retrieval chains
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
RL refines the path, not just the endpoint; process-level supervision is a more direct version of this principle
-
Can agents learn to reason better without just chasing rewards?
Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.
parallel agentic process supervision: RLVMR provides programmatic meta-reasoning rewards (planning/exploration/reflection/monitoring) for agentic navigation; RAG-Gym provides step-level retrieval rewards for agentic search; both demonstrate that outcome-only RL reinforces flawed trajectories in agentic settings
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
L2T provides the information-theoretic framework explaining why process rewards outperform outcome-only: per-episode information gain quantifies each step's contribution to correctness, which is exactly what outcome-only reward cannot supply; the theoretical grounding for RAG-Gym's empirical finding
-
Can document count be learned instead of fixed in RAG?
Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?
complementary RL in RAG: DynamicRAG learns what to include (document selection), RAG-Gym learns how to retrieve (step quality); both use generator output as reward signal
-
When should retrieval actually help versus hurt reasoning?
Retrieval augmentation seems universally beneficial, but does it always improve reasoning? This explores whether some reasoning steps benefit from internal knowledge alone, and when external retrieval introduces harmful noise rather than useful information.
shared MDP framing: DeepRAG learns per-step retrieve-or-not decisions, RAG-Gym supervises the quality of retrieval steps; DeepRAG optimizes the when, RAG-Gym optimizes the how
-
Why do outcome-based reward models fail at intermediate step evaluation?
Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
RAG-Gym is a domain-specific validation of the ORM/PRM trade-off: outcome-only reward in retrieval creates the same false-negative problem (correct intermediate retrieval penalized by later errors) that ORMs exhibit in reasoning; process-level supervision provides the dense step-feedback that PRMs enable
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
process-level supervision substantially outperforms outcome-only reward for training agentic rag systems