Reasoning and Knowledge

Research on how language models reason, retrieve, and apply knowledge, including chain-of-thought methods, retrieval-augmented generation, knowledge graphs, and systematic evaluation of reasoning capabilities and failure modes. This community investigates both the architectures and the limits of LLM reasoning.

202 notes (primary) · 923 papers · 13 sub-topics

View as

Retrieval-Augmented Generation (RAG)

16 notes

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Can retrieval be extended into multi-step chains like reasoning?

Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?

Can you adapt retrieval models without accessing target data?

Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.

Can fine-tuning replace query augmentation for retrieval?

Query augmentation helps retrievers handle ambiguous queries but increases input cost. Does fine-tuning the retrieval model achieve comparable performance without this overhead?

Can query-time graph construction replace pre-built knowledge graphs?

Does building dependency graphs from individual queries at inference time offer a more flexible and cost-effective alternative to constructing knowledge graphs over entire document collections upfront?

Can retrieval learn what actually helps answer questions?

Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.

Can knowledge graphs enable multi-hop reasoning in one retrieval step?

Standard RAG retrieves once but misses chains; iterative RAG follows chains but costs more. Can we encode multi-hop paths in a knowledge graph so one retrieval pass discovers them all?

Can long-context LLMs replace retrieval-augmented generation systems?

Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.

Can a model's partial response guide what to retrieve next?

Does using the model's in-progress output as a retrieval signal reveal information needs better than the original query alone? This explores whether generation itself can diagnose what documents are missing.

Does supervising retrieval steps outperform final answer rewards?

Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Can rationale-driven selection beat similarity re-ranking for evidence?

Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.

When should language models retrieve external knowledge versus use internal knowledge?

Can we model retrieval as a per-step decision problem rather than an always-on strategy? This matters because unnecessary retrieval adds noise and latency without improving accuracy.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

Explores whether rewarding coherent reasoning patterns during training helps models internalize domain knowledge better than standard fine-tuning approaches that treat all tokens equally.

Can simple uncertainty estimates beat complex adaptive retrieval?

Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.

Do vector embeddings actually measure task relevance?

Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?

Domain Specialization in LLMs

13 notes

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Does model access level determine which specialization techniques work?

Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Can organizing knowledge structures beat raw training data volume?

Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Chain-of-Thought and Reasoning Methods

12 notes

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Can models recognize question difficulty before they reason?

Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Should reasoning benchmarks score final answers or reasoning traces?

Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?

What makes reflection actually work in reasoning models?

Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Does training data format shape reasoning strategy more than domain?

What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.

Reasoning Critiques

11 notes

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Do chain-of-thought traces actually help users understand model reasoning?

Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Does chain-of-thought reasoning reflect genuine thinking or performance?

When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

What critical thinking skills do reasoning models actually lose?

Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.

Why do more capable reasoning models ignore your instructions?

As AI models develop stronger reasoning abilities, they seem to follow instructions less reliably. What causes this counterintuitive trade-off, and how severe is the problem in practice?

Reasoning by Reflection and Self-Critique

9 notes

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Does voting discard useful reasoning from losing chains?

When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?

Does reflection in reasoning models actually correct errors?

When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.

Can models learn reasoning from predicting any text?

Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.

Is reflection in reasoning models actually fixing mistakes?

Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Why does vanilla RAG produce shallow and redundant results?

Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.

Deep Research Agents

9 notes

Can schema-free graphs objectively evaluate open-ended search?

Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Reasoning Model Architectures

7 notes

Can LLM explanations actually help humans predict model behavior?

Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Can LLMs replace search engines during agent training?

Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.

Do reasoning models actually use the hints they receive?

This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

Can we monitor AI reasoning without destroying what makes it readable?

Explores the tension between using chain-of-thought traces to catch misbehavior and the risk that optimization pressures will make models hide their actual reasoning. Why readable reasoning might be incompatible with safe training.

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Logical Reasoning and Internal Rules

7 notes

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Why does partial formalization outperform full symbolic logic?

Explores whether injecting some symbolic structure into natural language reasoning works better than completely formalizing problems. Matters because it could reveal the optimal balance between structure and semantics for LLM reasoning.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

LLM Failure Modes

6 notes

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Can any computable LLM truly avoid hallucinating?

Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Why can't language models reverse learned facts?

Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.

Knowledge Graphs

6 notes

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Can externalizing LLM reasoning into structured knowledge graph triples enable smaller, cheaper models to match the performance of much larger ones? This explores whether making reasoning explicit and inspectable improves both capability and transparency.

Can knowledge graphs teach models deep domain expertise?

Explores whether organizing knowledge as structured graph paths, composed from simple to complex, can enable language models to develop genuine domain superintelligence rather than surface-level pattern matching.

Can knowledge graphs generate training data for search agents?

Exploring whether synthesizing questions from knowledge graph random walks with entity blurring can create the hard-to-find training data needed to teach deep search agents to reason and search effectively.

How vulnerable is GraphRAG to tiny text manipulations?

GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.

Can language models actually use graph structure information?

After fine-tuning on graph data, do LLMs learn to use actual connectivity patterns, or just recognize that graphs exist? This matters for understanding whether transformers can handle structured reasoning tasks.

Can symbolic rules from knowledge graphs guide complex reasoning?

Can deriving symbolic rules directly from knowledge graph structure help align natural language questions with structured reasoning paths? This explores whether explicit structural patterns outperform semantic similarity for multi-hop inference.

Question Answering and Search

4 notes

How do logic units preserve procedural coherence better than chunks?

Can structured retrieval units with prerequisites, headers, bodies, and linkers maintain step-by-step coherence in how-to answers where fixed-size chunks fail? This matters because procedural questions require sequential logic and conditional branching that chunk-based RAG cannot support.

What makes strategic question-asking succeed or fail?

Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

How can models select the most informative question to ask?

Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.

LLM Evaluations and Benchmarks

3 notes

Is hallucination detection progress real or just metric artifacts?

Standard evaluation metrics for hallucination detection may systematically overstate how well methods actually work. The question asks whether reported improvements reflect genuine capability or measurement error.

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?