LLM Reasoning and Architecture · Gravity7

Reasoning Critiques

13 notes

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Do chain of thought traces actually help humans understand reasoning?

When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Does chain-of-thought reasoning reflect genuine thinking or performance?

When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

What critical thinking skills do reasoning models actually lose?

Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.

Why do more capable reasoning models ignore your instructions?

As AI models develop stronger reasoning abilities, they seem to follow instructions less reliably. What causes this counterintuitive trade-off, and how severe is the problem in practice?

Mechanistic Interpretability

13 notes

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Do neural networks naturally break tasks into modular parts?

Can standard neural networks decompose complex tasks into separate subroutines implemented in distinct subnetworks, or do they only memorize input-output patterns? Understanding whether compositionality emerges from gradient-based learning matters for interpretability and generalization.

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Chain-of-Thought and Reasoning Methods

13 notes

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Can longer reasoning chains eliminate model sensitivity to input noise?

Does adding more chain-of-thought steps eventually make language models robust to perturbations? This matters because it determines whether extended reasoning is a viable defense against adversarial attacks.

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Should reasoning benchmarks score final answers or reasoning traces?

Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?

What makes reflection actually work in reasoning models?

Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.

When does sequential reasoning beat parallel voting?

Explores whether sequential chain-of-thought reasoning or parallel voting is more effective for different problem types. Understanding this trade-off helps predict which test-time compute strategy will work best.

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Reasoning Architectures

11 notes

Can modular cognitive tools boost LLM reasoning without training?

Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Can reasoning and tool execution run in parallel?

Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?

Can reasoning stay grounded without external feedback loops?

Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Which tokens in reasoning chains actually matter most?

Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Does separating planning from execution improve reasoning accuracy?

Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Does chain-of-thought reasoning actually explain model decisions?

Chain-of-thought is deployed to make AI systems transparent and auditable. But does the reasoning chain actually correlate with correct outputs, or does it just create an illusion of explainability?

Novel LLM Architectures

8 notes

Can a coordination layer turn LLM patterns into genuine reasoning?

LLMs excel at pattern retrieval but lack external constraint binding. Can a System 2 coordination layer—anchoring outputs to goals and evidence—transform statistical associations into goal-directed reasoning?

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Does long chain of thought reasoning follow molecular bond patterns?

Can we understand extended reasoning as organized like molecular structures with distinct interaction types? This matters because it explains why mixing reasoning traces from different sources often fails despite similar statistics.

Can cognition work by reusing memory instead of recomputing?

Does intelligence emerge from structured navigation of prior inference paths rather than fresh computation? This challenges whether brains and AI systems need to recalculate constantly or can leverage stored trajectories for efficiency.

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Can parallel architectures solve fundamentally sequential problems?

Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.

Logical Reasoning and Internal Rules

8 notes

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Diffusion-Based LLMs

8 notes

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Can diffusion models enable control that autoregressive models cannot reach?

Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?

Can diffusion language models match autoregressive inference speed?

Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?

Can diffusion models commit to answers before full decoding?

Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?

Can diffusion models perform evolutionary search in parameter space?

Diffusion models and evolutionary algorithms share equivalent mathematical structures. Can we leverage this equivalence to build evolutionary search methods that preserve solution diversity better than traditional algorithms?

Can reasoning and answers be generated separately in language models?

Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.

Can iterative revision cycles match how humans actually write?

Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?

Does autoregressive generation uniquely enable LLM scaling?

Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.

Domain Specialization in LLMs

7 notes

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

LLM Failure Modes

7 notes

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Can any computable LLM truly avoid hallucinating?

Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Why can't language models reverse learned facts?

Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.

LLM Architecture

7 notes

Can LLMs reconstruct censored knowledge from scattered training hints?

When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Reasoning Model Architectures

7 notes

Can model explanations help humans predict what models actually do?

Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

How often do reasoning models acknowledge their use of hints?

When language models receive reasoning hints that visibly change their answers, do they verbalize acknowledging those hints? This matters because it reveals whether chain-of-thought explanations can be trusted as honest.

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Why do language models explore so much less than humans?

Most LLMs decide too quickly in open-ended tasks, relying on uncertainty reduction rather than exploration. Understanding this gap could reveal how reasoning training changes decision-making timing.

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Cognitive Models and Latent Representations

6 notes

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Can communication pressure drive agents to learn shared abstractions?

Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Reasoning by Reflection and Self-Critique

5 notes

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Does voting discard useful reasoning from losing chains?

When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?

Can models learn reasoning from predicting text alone?

Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

LLM Memory

4 notes

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Deep Research Agents

3 notes

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Context Engineering

3 notes

How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Can longer task training help shorter tasks extrapolate?

When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.