RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re2Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps.
Meanwhile, although various LLM post-training algorithms have been developed to enhance downstream performance, they are not directly suited for agentic RAG, where the model must dynamically adjust its token-generation strategy in response to newly retrieved context during the reasoning process. Recent works have adapted reinforcement learning with outcome-based rewards for agentic RAG [69, 33, 8]. However, by overlooking process-level supervision, these approaches risk generating suboptimal intermediate search actions and exhibit limited generalization on unseen data. Given that the retrieval steps fundamentally shape the reasoning trajectory and ultimately influence the final answer, providing fine-grained supervision over these intermediate steps is essential for optimizing agentic RAG. Nevertheless, systematic analyses on how to optimize the language agent and identify best practices for enhancing overall agentic RAG performance are still lacking.
Our comprehensive experiments across three widely used LLM post-training algorithms reveal that fine-grained, process level supervision substantially boosts performance, particularly when both positive and negative feedback are integrated.