Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Paper · Source
RAGKnowledge Graphs

Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined graph-based RAG (ReG), which refines the weak supervision for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG also uses a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.

Despite the promise, graph-based RAG relies on weak retrievers, misaligned with LLMs: (I) weak supervision: Different from text-based RAG, there does not exist a general-purpose structural information retriever (Luo et al., 2025b), and the ground truth of graph-based RAG is usually lacking. Hence, graph-based retriever often requires training on specific datasets with heuristic-based weak supervision signals (Zhang et al., 2022; Luo et al., 2023). However, the weak supervision signals can often miss key supporting evidence or include spurious connections unrelated to the reasoning logic. Especially when the query requires multi-hop reasoning over KGs, missing key intermediate steps in the supervision signals will severely limit the performance of the retriever. (II) misorganized representation: The retrieved graph information can be represented in a variety of forms (Mavromatis & Karypis, 2024; Luo et al., 2023; Li et al., 2024a), and orders. Nevertheless, LLMs are generically sensitive to the ordering of context (Chen et al., 2024c; Guo et al., 2025c). The misorganized representation further adds to the complexity of the graph-based RAG and raises the question:

How can one align the weak retrievers to LLMs in graph-based RAG?

To mitigate the issue, we present Refined graph-based RAG (ReG), which incorporates the rich knowledge of LLMs to refine and align the weak supervision in graph-based RAG. Essentially, we show that graph-based RAG can be considered as a black-box combinatorial search over the KG G: given a query q, the goal is to identify a minimal sufficient subgraph b G∗ ⊆ G for an LLM to answer q correctly. Here, the LLM serves as a black-box evaluator that assesses the utility of retrieved subgraphs. Using the formulation, we show that resolving the original black-box optimization problem is computationally intractable and thus not feasible under realistic LLM usage budgets. Therefore, ReG incorporates the LLMs in a simple yet effective way: using LLMs to select better reasoning chains among the candidate chains extracted from KG. The improved supervision signal improves the identification of the optimal subgraph in a cost-efficient manner. To align the retrieval results to LLMs, ReG reorganizes the retrieved contents into logic-preserving chains, which is simple but significantly mitigates distraction and inefficiency during LLM reasoning.

Extensive experiments demonstrate ReG achieves state-of-the-art results on prominent multi-hop knowledge graph question answering (KGQA) benchmarks. Notably, it yields retrievers with stronger zero-shot generalizability to out-of-distribution (OOD) KGs, mitigating the weakness of lacking foundation models in graph-based RAG.