Can knowledge graphs built at inference time outperform pre-built retrieval augmented generation?
This explores whether building knowledge graphs on the fly — at the moment a question is asked — beats the standard approach of constructing them in advance, and what the corpus says about the tradeoff between the two.
This explores whether building knowledge graphs on the fly — at the moment a question is asked — beats the standard approach of constructing them in advance. The corpus has a direct answer and a more interesting set of surrounding ideas about *why* it might. The most on-the-nose finding is LogicRAG, which constructs a directed graph from the query itself at inference time rather than pre-building one across the whole corpus Can query-time graph construction replace pre-built knowledge graphs?. The pitch is that pre-built graphs carry three taxes: the cost of constructing them, staleness as the underlying data changes, and inflexibility — a graph built for the average query isn't shaped for *your* query. Building per-query sidesteps all three while keeping the multi-hop reasoning that graphs are good for.
But the corpus reframes the question in a useful way: the real choice may not be "build early vs. build late" so much as "match the structure to the task." StructRAG argues that no single knowledge structure is universally best — it trains a router to pick among tables, graphs, algorithms, catalogues, and plain chunks depending on what the query demands, grounding this in cognitive-fit theory from psychology Can routing queries to task-matched structures improve RAG reasoning?. Seen this way, inference-time construction wins precisely *because* it can be query-specific, not because graphs beat RAG in the abstract. That theme echoes across the collection: retrieval should adapt dynamically and stay tightly coupled to reasoning rather than follow a fixed pipeline How should systems retrieve and reason with external knowledge?, and separating query planning from answer synthesis into distinct stages outperforms flat retrieval on hard multi-hop questions Do hierarchical retrieval architectures outperform flat ones on complex queries?.
The catch is that pre-built graphs aren't just overhead to be eliminated — their explicit structure is doing real work. SymAgent derives symbolic navigational rules from a graph's topology, beating retrieval methods that lean only on semantic similarity, because the graph encodes relationships that embeddings blur together Can symbolic rules from knowledge graphs guide complex reasoning?. And there's a hard limit on what you can skip: long-context LLMs can absorb a corpus and match RAG on semantic lookup, but they collapse on structured relational queries that require joining across tables — context length alone can't fake structure Can long-context LLMs replace retrieval-augmented generation systems?. So a query-time graph still has to actually reconstruct the relational scaffolding; it just does so on demand instead of in advance.
The more surprising thread is that pre-built knowledge graphs may earn their keep not at retrieval time at all, but at *training* time. One line of work fine-tunes a 32B model on 24,000 reasoning tasks walked out of a medical knowledge graph and reaches state-of-the-art across 15 domains — the conclusion being that structured composition matters more than raw scale Can knowledge graphs teach models deep domain expertise?. Another uses random walks through a graph, with entities selectively blurred, to mint hard multi-hop questions that train search agents Can knowledge graphs generate training data for search agents?. So the deeper answer is that "inference-time vs. pre-built" may be a false binary: build the graph once to *teach* the model the shape of a domain, then build lightweight graphs per-query to *navigate* it.
What you didn't know you wanted to know: the strongest argument for inference-time construction isn't speed or cost — it's that a fresh per-query graph can be shaped to the exact reasoning the question needs, which is the same insight (match the structure to the task) that the whole adaptive-retrieval literature keeps rediscovering from different angles.
Sources 8 notes
LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.
Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.
SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
KG-based random walks with selective entity obscuring create verifiable, multi-hop questions that train deep search agents effectively. DeepDive-32B trained on this data achieves 14.8% on BrowseComp, outperforming larger models through end-to-end multi-turn RL.