Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities

Paper · arXiv 2505.20099 · Published May 26, 2025

However, LLM-based QA struggles with complex QA tasks due to poor reasoning capacity, outdated knowledge, and hallucinations. Several recent works synthesize LLMs and knowledge graphs (KGs) for QA to address the above challenges. In this survey, we propose a new structured taxonomy that categorizes the methodology of synthesizing LLMs and KGs for QA according to the categories of QA and the KG’s role when integrating with LLMs.

Complex QA usually involves knowledge interactions and fusion among data across modalities and sources, and an excellent understanding of complex queries and user interactions, whereas the RAG-based QA suffers a lot from the following technical challenges. (1) Knowledge conflicts: Conflicts occur due to the fusion of inconsistent and overlapping knowledge between LLMs and external sources in RAG-based QA that may further tend to generate inconsistent answers. (2) Poor relevance and quality of retrieved context: The accuracy of the generated answers in RAG-based QA largely depends on the relevance and quality of the retrieved context, where irrelevant context leads to incorrect results. (3) Lack of iterative and multi-hop reasoning: RAG-based QA struggles to generate accurate and explainable answers for questions requiring global and summarized contexts due to a lack of iterative.

These GraphRAG and KG-RAG based QA approaches introduce several modules, such as knowledge integration and fusion, reasoning guidelines, and knowledge validation and refinement, to mitigate the above challenges. Motivation.

2 Complex QA

The methodology in synthesizing LLMs and KGs for QA has been exploited as follows.

2.1 Multi-document QA

Multi-document QA refers to the QA over contexts from multiple documents, while efficiently and effectively retrieving the relevant knowledge from multiple contexts is the main technical challenge. To reduce the retrieval latency and improve the quality of the retrieved context for multidocument QA, KGP (Wang et al., 2024d) introduces an LLM-based graph traversal agent for retrieving relevant knowledge from KG. Similarly, CuriousLLM (Yang and Zhu, 2025) integrates a knowledge graph prompting, reasoning-infused LLM agent, and graph traversal agent to augment LLMs for multi-document QA. VisDom (Suri et al., 2024) introduces a novel multimodal RAG for multi-document question answering by integrating and fusing the multi-modal knowledge and leveraging the (Chain-of-thought) CoT-based reasoning.

2.2 Multi-modal QA

Multi-modal QA refers to the QA over multi-modal data, and visual QA (VQA) is one of the typical multi-modal QA. To retrieve the most relevant knowledge from the external KG for enhancing VQA, MMJG (Wang et al., 2022) introduces an adaptive knowledge selection to jointly select knowledge from visual and text knowledge based on the knowledge-aware attention and multi-modal guidance. To effectively retrieve the evidence from multi-modal data, RAMQA (Bai et al., 2025) enhances multi-modal retrieval-augmented QA by integrating learning-to-rank with training of generative models via multi-task learning. KVQA (Dong et al., 2024b) integrates LLMs with multimodal knowledge by using a two-stage prompting and a pseudo-siamese graph medium fusion to balance intra-modal and inter-modal reasoning.

2.3 Multi-hop QA

Multi-hop QA differs from simple QA, it usually involves multi-step reasoning to generate the final answers. The basic idea is to decompose the multihop questions into multiple single-hop questions, then generate the answers for each single-hop question, and finally integrate them (Linders and Tomczak, 2025). For instance, GraphLLM (Qiao et al., 2024) leverages LLMs to decompose the multihop question into several simple sub-questions and retrieve the sub-graphs via GNNs and LLMs to generate the answers for sub-questions based on graph reasoning. HOLME (Panda et al., 2024) utilizes a context-aware retrieved and pruned hyperrelational KG that is constructed based on the entity-document graph to enhance LLMs for generating the answers of multi-hop QA. To enable accurate fact retrieval and reasoning of LLMs for multi-hop QA, GMeLLo (Chen et al., 2024a) effectively integrates the explicit knowledge of KGs with the linguistic knowledge of LLMs by introducing the fact triple extraction, relation chain extraction, and query and answer generation.

2.4 Multi-run and Conversational QA

The challenge of multi-run and conversational QA lies in how to make the language model (LM) easily understand the questions and intermediate interactions. To make user interactions easily understood by machines, CoRnNetA (Liu et al., 2024b) introduces LLM-based question reformulations, reinforcement learning agents, and soft reward mechanisms to improve the interpretation of multi-turn interactions with KGs. The conversational QA involves several multi-run QA to refine and get accurate answers through multiple rounds of interactions The knowledge aggregation module and graph reasoning are introduced for joint reasoning between the graph and LLMs (Jain and Lapata, 2024) to address the challenges of understanding the question and context for conversational QA. To improve the contextual understanding and the answer quality for conversational QA, SELF-multi- RAG (Roy et al., 2024) leverages LLMs to retrieve from the summarized conversational history and reuse the retrieved knowledge for augmentation.

2.5 Explainable QA

Explainable QA (XQA) aims to provide explanations for the generated answers based on the reasoning over the factual KGs. To effectively integrate the multiple sources of knowledge for XQA, RoHT (Zhang et al., 2023) introduces a two-stage XQA method that implements the probabilistic reasoning based on the constructed Hierarchical Question Decomposition Tree (HQDT) from the aggregated knowledge. To trace the provenance and improve the explainability of the answers, EXPLAIGNN (Christmann et al., 2023) constructs a heterogeneous graph from retrieved KB knowledge and user explanations and generates explanatory evidence based on GNN with question-level attention. RID (Feng et al., 2025) directly integrates the unsupervised retrieval with LLMs based on reinforcement learning-driven knowledge distillation.

5 Open Challenges and Opportunities

We summarize the challenges by highlighting the opportunities and discussing the future direction. Scaling to both Effectiveness and Efficiency. LLM+KG systems retrieve the facts and perform multi-hop reasoning under tight latency and memory budgets. Three bottlenecks are emerging:

(1) Structure-aware retrieval: Vanilla dense or sparse retrieval treats a KG as an unordered triples, thereby discarding topological cues that are vital for pruning the search space (Tian et al., 2025). Hierarchical graph partitioning, dynamic neighbourhood expansion, and learned path-prior proposal networks are promising ways to expose structure to the retriever while keeping the index sub-linear.

(2) Amortized reasoning: Current prompting pipelines repeatedly query the KG for every beam or CoT step. Caching subgraphs, reusing intermediate embeddings, and exploiting incremental-compute friendly hardware can mitigate the quadratic blow-up of iterative reasoning.

(3) Lightweight answer validation: Most guardrails rely on large LLMs, where probabilistic logic programs, or bloom filter sketches could provide on-device verification with O(1) additional parameters. An opportunity is to design the retriever and validator that uncertainty estimates from the former guide selective execution of the latter. Knowledge Alignment and Dynamic Integration.

Once a KG snapshot is injected into an LLM, it starts to become outdated. Just like real-world KGs usually involve adding new entities, deleting relations, and resolving contradictions. Future work should: (1) Quantify alignment: We lack metrics that score not only semantic overlap but also structural compatibility between parametric knowledge in the LLM and symbolic knowledge in the KG. Contrastive probing with synthetic counterfactuals or topology-aware alignment losses may fill this gap. (2) Facilitate real-time updates: Parameter-efficient tuning (e.g. LoRA modules keyed by graph deltas) and retrieval-time patching (streaming KGs with temporal indices) are early steps toward stream-time knowledge alignment. (3) Detect and resolve conflicts: Bayesian trust networks, source-aware knowledge distillation, and multi-agent debate protocols can estimate and reconcile confidence scores across modalities and sources. Incorporating these into the decoding objective is an open challenge with high pay-off.

Explainable and Fairness-Aware QA. The scale of LLMs poses challenges to explainability and fairness in QA. While integrating KGs offers a path toward interpretable reasoning, it also introduces computational challenges and fairness concerns. Future work may consider the following directions: (1) Reasoning over subgraphs: Retrieving subgraphs from large-scale KGs is computationally expensive and often results in overly complex or incomprehensible explanations. Structure-aware retrieval and reranking methods should be employed to identify subgraphs consistent with the gold paths. Furthermore, CoT-based prompting can be used to guide LLMs in generating explicit reasoning steps grounded in the retrieved subgraphs. (2) Fairnessaware knowledge retrieval: LLMs can capture social biases from training data, but KGs may contain incomplete or biased knowledge. As a result, the fairness concerns remain in RAG (Wu et al., 2024b). Incorporating fairness-aware techniques into KG retrieval (e.g., reranking based on bias detection) and integrating them with counterfactual prompting can mitigate bias. (3) Multi-turn QA: Single-turn QA restricts the exploration of diverse perspectives and limits the exploration of reasoning processes. Developing multi-turn QA with retrieval strategies that can dynamically detect and adjust for bias and improve further the explainability and fairness through multi-turn interactions.