Knowledge Retrieval and RAG

Research on integrating external knowledge into LLMs through retrieval-augmented generation, knowledge graphs, and question answering systems. This community studies when and how retrieval helps or hurts reasoning, and how to effectively combine structured and unstructured knowledge sources.

51 notes (primary) · 190 papers · 4 sub-topics

View as

Retrieval-Augmented Generation (RAG)

19 notes

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Can retrieval be scaled like reasoning at test time?

Standard RAG retrieves once, but multi-hop tasks need adaptive retrieval. Can we train models to plan retrieval chains and vary their length at test time to improve accuracy, the way test-time scaling works for reasoning?

Can you adapt retrieval models without accessing target data?

Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.

What do enterprise RAG systems need beyond accuracy?

Academic RAG benchmarks focus on question-answering accuracy, but enterprise deployments in regulated industries face five distinct requirements—compliance, security, scalability, integration, and domain expertise—that standard architectures don't address.

Can fine-tuning replace query augmentation for retrieval?

Query augmentation helps retrievers handle ambiguous queries but increases input cost. Does fine-tuning the retrieval model achieve comparable performance without this overhead?

Can long-context models resolve retriever-reader imbalance?

Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?

Can query-time graph construction replace pre-built knowledge graphs?

Does building dependency graphs from individual queries at inference time offer a more flexible and cost-effective alternative to constructing knowledge graphs over entire document collections upfront?

Can retrieval learn what actually helps answer questions?

Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.

Can knowledge graphs enable multi-hop reasoning in one retrieval step?

Standard RAG retrieves once but misses chains; iterative RAG follows chains but costs more. Can we encode multi-hop paths in a knowledge graph so one retrieval pass discovers them all?

Can long-context LLMs replace retrieval-augmented generation systems?

Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.

Can a model's partial response guide what to retrieve next?

Can generation reveal implicit information needs that the original query cannot express? This explores whether using in-progress responses as retrieval signals outperforms upfront query formulation.

Does question type determine the right retrieval strategy?

Explores whether different non-factoid question types require distinct retrieval and decomposition approaches. Matters because standard RAG fails when applied uniformly to debate, comparison, and experience questions despite being effective for factoid queries.

Does supervising retrieval steps outperform final answer rewards?

Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Can rationale-driven selection beat similarity re-ranking for evidence?

Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.

When should retrieval actually help versus hurt reasoning?

Retrieval augmentation seems universally beneficial, but does it always improve reasoning? This explores whether some reasoning steps benefit from internal knowledge alone, and when external retrieval introduces harmful noise rather than useful information.

Can document count be learned instead of fixed in RAG?

Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?

Why does retrieval-augmented generation fail in production?

RAG systems work in controlled demos but break down in real-world deployment, particularly for high-stakes domains like medicine and finance. Understanding the structural reasons behind these failures matters for building reliable AI systems.

Do vector embeddings actually measure task relevance?

Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?

RAG Variants and Taxonomy

13 notes

Can learned traversal policies beat exhaustive graph reading?

As knowledge graphs grow, can agents learn which nodes to explore rather than ingesting entire subgraphs? This explores whether MCTS and reinforcement learning can solve the context-window constraint better than dumping whole graphs into the LLM.

Can we defend RAG systems from corpus poisoning without retraining?

Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.

Can visual similarity alone guide robot object retrieval?

Visual retrieval works for text QA but fails for embodied agents—the most visually similar object may be unreachable or locked. Should retrieval systems for robots rank by what the agent can physically execute instead?

Can RAG systems safely learn from their own generated answers?

Explores whether retrieval-augmented generation can feed its outputs back into the corpus without corrupting knowledge with hallucinations. The core problem: how to prevent feedback loops from compounding errors.

Can building a document map first improve retrieval over long texts?

Does constructing a global summary before retrieval help RAG systems connect scattered evidence in long documents the way human readers do? This tests whether understanding document structure improves what gets retrieved.

Can RAG systems refuse to answer without reliable evidence?

Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.

Can smaller models handle RAG filtering while larger models focus on synthesis?

Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.

Can hypergraphs capture multi-hop reasoning better than graphs?

Explores whether organizing retrieved facts as hyperedges—connecting multiple entities at once—lets multi-step reasoning preserve higher-order relations that binary edges must break apart, and whether the added complexity pays off.

How can video retrieval handle multiple modalities at different times?

Video RAG systems struggle because the same content appears across visual, audio, and subtitle tracks at offset timestamps. Can temporal awareness in text ranking and frame sampling solve cross-modal misalignment?

Can multimodal knowledge graphs answer questions that flat retrieval cannot?

Can organizing entities and relations from text and images into hierarchical knowledge graphs enable reasoning across entire long documents in ways that chunk-based retrieval fundamentally cannot? Why does hierarchy matter as much as multimodality?

Can pretraining data statistics detect hallucinations better than model confidence?

This explores whether tracking rare entity co-occurrences in training data provides a more reliable hallucination signal than measuring model confidence. It matters because confidence-based retrieval triggers miss the model's most dangerous mistakes.

Should retrieval triggers use model confidence or data rarity?

FLARE and QuCo-RAG propose different signals for when to retrieve in RAG systems. Are these competing approaches, or do they each catch distinct failure modes that a combined strategy could address?

Can describing images in text improve zero-shot recognition?

Explores whether converting visual queries to natural-language descriptions before retrieval outperforms direct visual embedding matching. This matters because visual variation in real-world queries often breaks brittle similarity metrics.

Knowledge Graphs

3 notes

Can externalizing reasoning into knowledge graphs help smaller models compete?

Can structuring LLM reasoning as explicit knowledge graph triples enable smaller, cheaper models to solve complex tasks more effectively? This matters because it could make advanced reasoning accessible without scaling model size.

Can community detection enable RAG systems to answer global corpus questions?

Standard RAG struggles with corpus-wide questions that require understanding overall themes rather than retrieving specific passages. Can graph community detection overcome this limitation at scale?

How vulnerable is GraphRAG to tiny text manipulations?

GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.

Question Answering and Search

1 note

How do logic units preserve procedural coherence better than chunks?

Can structured retrieval units with prerequisites, headers, bodies, and linkers maintain step-by-step coherence in how-to answers where fixed-size chunks fail? This matters because procedural questions require sequential logic and conditional branching that chunk-based RAG cannot support.