← All clusters

LLM Reasoning and Architecture

Covers how language models reason, plan, and represent knowledge internally, including chain-of-thought methods, reflection, search-based reasoning, mechanistic interpretability, and architectural designs that support structured thinking. Studied by researchers building and analyzing next-generation reasoning systems.

209 notes (primary) · 1102 papers · 22 sub-topics
View as

Reasoning Critiques

13 notes

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Explore related Read →

Does chain-of-thought reasoning actually generalize beyond training data?

Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.

Explore related Read →

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Explore related Read →

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.

Explore related Read →

Do chain of thought traces actually help humans understand reasoning?

When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.

Explore related Read →

Does failed-step fraction predict reasoning quality better?

Can we use the fraction of abandoned reasoning branches to forecast whether a model will solve a problem correctly? This matters because it could guide more efficient test-time scaling than simply adding more tokens.

Explore related Read →

What do models actually learn from chain-of-thought training?

When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.

Explore related Read →

Why do reasoning models overthink ill-posed questions?

Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.

Explore related Read →

Does chain-of-thought reasoning reflect genuine thinking or performance?

When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.

Explore related Read →

Why do reasoning models fail at exception-based rule inference?

Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.

Explore related Read →

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Explore related Read →

What critical thinking skills do reasoning models actually lose?

Step-by-step reasoning training optimizes narrow deductive thinking while degrading meta-cognitive abilities like recognizing futile thinking and maintaining tentative reasoning. Understanding this tradeoff matters for deploying reasoning models reliably.

Explore related Read →

Why do more capable reasoning models ignore your instructions?

As AI models develop stronger reasoning abilities, they seem to follow instructions less reliably. What causes this counterintuitive trade-off, and how severe is the problem in practice?

Explore related Read →

Mechanistic Interpretability

13 notes

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

Explore related Read →

How do language models organize features across processing layers?

Do neural networks arrange learned features into meaningful hierarchies as they process information? Understanding this structure could reveal how models build understanding from raw tokens to abstract concepts.

Explore related Read →

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Explore related Read →

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Explore related Read →

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Explore related Read →

How do language models detect injected steering vectors internally?

Research investigates the mechanistic basis for LLM introspective awareness—specifically, how models detect when their internal states have been artificially manipulated. Understanding this could reveal both security vulnerabilities and latent model capabilities.

Explore related Read →

Do language models understand in fundamentally different ways?

Does mechanistic evidence reveal distinct tiers of understanding in LLMs—from concept recognition to factual knowledge to principled reasoning? And do these tiers coexist rather than replace each other?

Explore related Read →

Do neural networks naturally break tasks into modular parts?

Can standard neural networks decompose complex tasks into separate subroutines implemented in distinct subnetworks, or do they only memorize input-output patterns? Understanding whether compositionality emerges from gradient-based learning matters for interpretability and generalization.

Explore related Read →

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Explore related Read →

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Explore related Read →

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Explore related Read →

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Explore related Read →

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Explore related Read →

Chain-of-Thought and Reasoning Methods

13 notes

Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Explore related Read →

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Explore related Read →

Can reasoning models actually sustain long-chain reflection?

Tests whether large reasoning models genuinely perform self-correction and backtracking, or merely simulate it fluently. Uses constraint satisfaction problems where performance cannot be faked by surface plausibility.

Explore related Read →

Why does autoregressive generation fail at constraint satisfaction?

Explores whether the 20-23% performance ceiling on constraint satisfaction benchmarks reflects model limitations or a fundamental architectural mismatch between how LLMs generate tokens and how constraint solvers need to work.

Explore related Read →

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Explore related Read →

Can longer reasoning chains eliminate model sensitivity to input noise?

Does adding more chain-of-thought steps eventually make language models robust to perturbations? This matters because it determines whether extended reasoning is a viable defense against adversarial attacks.

Explore related Read →

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Explore related Read →

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

Explore related Read →

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Explore related Read →

Should reasoning benchmarks score final answers or reasoning traces?

Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?

Explore related Read →

What makes reflection actually work in reasoning models?

Does reflection in language models involve genuine self-correction, or just confident-sounding traces? This question probes whether models can truly backtrack and revise versus merely mimicking reflective language.

Explore related Read →

When does sequential reasoning beat parallel voting?

Explores whether sequential chain-of-thought reasoning or parallel voting is more effective for different problem types. Understanding this trade-off helps predict which test-time compute strategy will work best.

Explore related Read →

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Explore related Read →

Reasoning Architectures

11 notes

Can modular cognitive tools boost LLM reasoning without training?

Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?

Explore related Read →

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Explore related Read →

Can reasoning and tool execution run in parallel?

Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?

Explore related Read →

Can reasoning stay grounded without external feedback loops?

Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.

Explore related Read →

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Explore related Read →

Which tokens in reasoning chains actually matter most?

Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.

Explore related Read →

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Explore related Read →

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Explore related Read →

Does separating planning from execution improve reasoning accuracy?

Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.

Explore related Read →

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Explore related Read →

Does chain-of-thought reasoning actually explain model decisions?

Chain-of-thought is deployed to make AI systems transparent and auditable. But does the reasoning chain actually correlate with correct outputs, or does it just create an illusion of explainability?

Explore related Read →

Novel LLM Architectures

8 notes

Can a coordination layer turn LLM patterns into genuine reasoning?

LLMs excel at pattern retrieval but lack external constraint binding. Can a System 2 coordination layer—anchoring outputs to goals and evidence—transform statistical associations into goal-directed reasoning?

Explore related Read →

Are neural network optimizers actually memory systems?

Do gradient-based optimizers like Adam function as associative memory modules that compress context, just like network layers? This reframes the relationship between training and learning.

Explore related Read →

Can byte-level models match tokenized performance with better efficiency?

Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?

Explore related Read →

Can recurrent hierarchies achieve reasoning that transformers cannot?

Can a dual-timescale recurrent architecture escape the computational limitations of standard transformers and solve complex reasoning tasks without explicit chain-of-thought? This explores whether architectural design, not scale, enables true algorithmic reasoning.

Explore related Read →

Does long chain of thought reasoning follow molecular bond patterns?

Can we understand extended reasoning as organized like molecular structures with distinct interaction types? This matters because it explains why mixing reasoning traces from different sources often fails despite similar statistics.

Explore related Read →

Can cognition work by reusing memory instead of recomputing?

Does intelligence emerge from structured navigation of prior inference paths rather than fresh computation? This challenges whether brains and AI systems need to recalculate constantly or can leverage stored trajectories for efficiency.

Explore related Read →

Can looped transformers generalize to unseen knowledge combinations?

Do transformers that reuse layers across iterations succeed where standard transformers fail at composing facts in novel ways? This matters because systematic generalization is a hallmark of human reasoning.

Explore related Read →

Can parallel architectures solve fundamentally sequential problems?

Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.

Explore related Read →

Logical Reasoning and Internal Rules

8 notes

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

Explore related Read →

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

Explore related Read →

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Explore related Read →

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

Explore related Read →

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

Explore related Read →

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

Explore related Read →

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Explore related Read →

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Explore related Read →

Diffusion-Based LLMs

8 notes

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Explore related Read →

Can diffusion models enable control that autoregressive models cannot reach?

Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?

Explore related Read →

Can diffusion language models match autoregressive inference speed?

Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?

Explore related Read →

Can diffusion models commit to answers before full decoding?

Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?

Explore related Read →

Can diffusion models perform evolutionary search in parameter space?

Diffusion models and evolutionary algorithms share equivalent mathematical structures. Can we leverage this equivalence to build evolutionary search methods that preserve solution diversity better than traditional algorithms?

Explore related Read →

Can reasoning and answers be generated separately in language models?

Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.

Explore related Read →

Can iterative revision cycles match how humans actually write?

Does framing research writing as a diffusion process—where drafts are refined through retrieval-augmented cycles—better capture human cognition than linear pipelines and reduce information loss?

Explore related Read →

Does autoregressive generation uniquely enable LLM scaling?

Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.

Explore related Read →

Domain Specialization in LLMs

7 notes

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Explore related Read →

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Explore related Read →

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

Explore related Read →

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

Explore related Read →

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Explore related Read →

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Explore related Read →

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Explore related Read →

LLM Failure Modes

7 notes

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Explore related Read →

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Explore related Read →

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Explore related Read →

Can any computable LLM truly avoid hallucinating?

Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.

Explore related Read →

How does instruction density affect model performance?

As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.

Explore related Read →

Do reasoning traces actually expose private user data?

Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.

Explore related Read →

Why can't language models reverse learned facts?

Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.

Explore related Read →

LLM Architecture

7 notes

Can LLMs reconstruct censored knowledge from scattered training hints?

When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.

Explore related Read →

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Explore related Read →

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Explore related Read →

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Explore related Read →

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Explore related Read →

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Explore related Read →

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Explore related Read →

Cognitive Models and Latent Representations

6 notes

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Explore related Read →

Can communication pressure drive agents to learn shared abstractions?

Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.

Explore related Read →

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Explore related Read →

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Explore related Read →

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Explore related Read →

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Explore related Read →

Reasoning by Reflection and Self-Critique

5 notes

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Explore related Read →

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Explore related Read →

Does voting discard useful reasoning from losing chains?

When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?

Explore related Read →

Can models learn reasoning from predicting text alone?

Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.

Explore related Read →

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Explore related Read →

LLM Memory

4 notes

When do language models stop memorizing and start generalizing?

Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.

Explore related Read →

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?

Explore related Read →

Can recursive subtask trees overcome context window limits?

Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.

Explore related Read →

Where do memorization errors arise in chain-of-thought reasoning?

Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.

Explore related Read →

Deep Research Agents

3 notes

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Explore related Read →

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Explore related Read →

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Explore related Read →

Context Engineering

3 notes

How much does demo position alone affect in-context learning accuracy?

Moving demonstrations from prompt start to end without changing their content produces surprisingly large accuracy swings. Does spatial position in the prompt matter more than what demonstrations actually contain?

Explore related Read →

Can longer task training help shorter tasks extrapolate?

When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.

Explore related Read →

Can we steer reasoning toward brevity without retraining?

This explores whether model reasoning style occupies learnable geometric directions in activation space, and whether we can shift toward concise thinking by steering through that space without expensive retraining.

Explore related Read →

Mobile and On-Device LLMs

3 notes

Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Explore related Read →

Does recomputing weights cost less than moving them on mobile?

Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.

Explore related Read →

What actually limits language models on mobile phones?

Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.

Explore related Read →

LLM Evaluations and Benchmarks

2 notes

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Explore related Read →

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Explore related Read →

Task Planning

1 note