How should researchers navigate LLM reasoning research?

Entry point to organized synthesis of LLM research across reasoning, language, and epistemology.

Topic Hub · 505 linked notes · 6 sections

View as

Topic Maps

57 notes

How should we allocate compute budget at inference time?

Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?

When does thinking too much actually hurt reasoning?

Research shows that extending inference-time reasoning beyond a task-dependent threshold degrades accuracy rather than improving it. Understanding what triggers this 'overthinking' effect and how to stay within safe bounds is critical for designing efficient inference systems.

How should we categorize test-time scaling methods?

Test-time scaling is fragmenting into many approaches. What's the right way to organize them—by architecture, training needs, or when compute happens? Understanding the taxonomy helps predict which methods will scale.

What makes chain-of-thought reasoning actually work?

Explores how reasoning traces are structured, what components they rely on, and the specific conditions under which they break down or fail to generalize beyond training patterns.

What makes chain-of-thought reasoning actually work?

Explores the structural and mechanical properties that determine how reasoning traces function in language models. Understanding these properties reveals why format matters more than logic and what tokens carry the most information about correct answers.

Why does chain-of-thought reasoning fail so often?

Explores the limits of CoT as a reasoning technique. Understanding when and why CoT breaks down reveals whether models are genuinely reasoning or imitating reasoning patterns.

How should reasoning systems actually be architected?

What design patterns and mechanisms make reasoning systems more capable and efficient? This explores whether reasoning emerges from training or architecture, and how to build systems that reason effectively without massive compute.

How do reasoning models actually fail under pressure?

This explores where reasoning models break down—whether through adversarial attacks, social reasoning gaps, or unfaithful traces that resist monitoring. Understanding failure modes reveals what these systems genuinely can and cannot do.

Can we actually trust reasoning model outputs?

When reasoning models show their work through reflection and traces, do those explanations faithfully represent what's happening? This explores whether self-monitoring mechanisms genuinely correct errors or just create an illusion of reliability.

Where exactly do reasoning models fail and break?

Exploring the specific failure modes in reasoning models—from search inefficiency and mode selection errors to adversarial vulnerabilities and social reasoning gaps. Understanding these breaks is crucial for building more robust AI systems.

How does reinforcement learning reshape what models can reason about?

RL training modifies model parameters and exploration strategies, but what capabilities does it actually unlock versus degrade? This map explores RL mechanics, reward dynamics, and the hidden costs of optimization.

What actually changes inside a model during RL training?

RL training modifies only sparse regions of model parameters through suppression of incorrect paths rather than broad capability building. Understanding these mechanics reveals how fine-tuning shapes reasoning and what hidden costs accompany optimization.

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

How well do reward models actually evaluate reasoning?

Can systems that judge AI reasoning be trusted to work reliably, or do they fail in systematic ways? This matters because flawed evaluators can't improve the systems they train.

How does test-time scaling work at the agent level?

This explores how agents can spend compute at inference time across reasoning, interaction, and coordination. It examines whether multi-agent systems succeed through intelligent coordination or simply through token spending.

How does test-time scaling work for individual research agents?

Can search budget follow the same scaling curves as reasoning tokens in agentic systems? This explores whether deep research exhibits test-time scaling laws similar to reasoning, with implications for inference-compute tradeoffs.

What makes multi-agent teams actually perform better?

Explores what drives performance gains when multiple AI agents collaborate—whether intelligent coordination, team composition, or other factors explain why multi-agent systems work.

Where exactly does language competence break down in LLMs?

LLMs handle surface-level language patterns well but fail systematically on tasks requiring inference and structural depth. Understanding where and why these failures occur reveals what LLMs have actually learned about language.

Where exactly do language models fail at structural language tasks?

LLMs perform well on explicit, consistent language patterns but struggle with implicit structure and inference. Understanding where and why these breakdowns occur helps identify fundamental limitations in what models actually learn about language.

Why do LLMs fail at understanding what remains unsaid?

LLMs excel at pattern-matching surface language but struggle with pragmatics—meaning derived from context, speaker intent, and what's deliberately left implicit. This gap reveals a fundamental limitation in how LLMs acquire language competence compared to humans.

What kind of thing is an LLM really?

This hub explores whether LLMs are fundamentally different from human cognition or share deeper structural similarities. The research draws on philosophy, neuroscience, and mechanistic analysis to locate where LLMs diverge from human intelligence and where they converge.

What do language models actually know?

Explores what LLMs genuinely understand versus what they merely simulate. The distinction matters because apparent competence often masks fundamental epistemic gaps and predictable failure modes.

How do LLMs fail to know what they seem to understand?

This explores the specific, repeatable ways LLMs track language patterns without genuine understanding. Why do models explain concepts correctly but fail to apply them, or possess knowledge that doesn't influence their outputs?

How well do language models understand their own knowledge?

Explores whether LLMs have genuine self-awareness about what they know and can do, and how this self-knowledge (or lack thereof) shapes human-AI interaction dynamics and user trust.

What makes a world model actually useful for reasoning?

Exploring whether language models develop genuine world models that simulate possibilities rather than merely predict sequences. The distinction matters because accurate prediction doesn't guarantee the underlying mechanism was learned.

What grounds language understanding in systems without embodiment?

Can language models acquire genuine meaning through text training alone, or do they lack something fundamental that human language requires—like embodiment, social participation, or causal contact with the world?

Do reasoning traces show how models actually think?

We explore whether the step-by-step reasoning that language models produce genuinely reflects their internal reasoning process, or merely mimics the appearance of reasoning while hiding what actually drives their answers.

How accurately can language models simulate human personalities?

Can LLMs reliably replicate how specific people think and act? Understanding persona simulation fidelity matters because these models are increasingly used for research, personalization, and behavioral prediction—but systematic distortions may hide beneath surface accuracy.

Why do AI systems fail at social and cultural interpretation?

Explores why LLMs excel at predicting social norms statistically but struggle to make the interpretive leaps that make content meaningful to specific communities. This gap hints at a fundamental difference between statistical pattern-matching and genuine social reasoning.

Why do LLMs excel at social norms yet fail at theory of mind?

LLMs show a striking paradox: they predict social norms at superhuman levels but regress on theory of mind tasks compared to older models. What explains this disconnect, and what does it reveal about how these systems reason about minds versus rules?

Does AI that soothes emotions actually harm human wellbeing?

When AI systems reduce negative emotions by default, do they prevent people from learning important things about themselves and their situations? This explores whether emotional pacification conflicts with genuine empathy and self-knowledge.

What actually happens inside a language model?

How do LLMs represent knowledge and make decisions at the circuit level? Understanding internal mechanisms reveals whether identical outputs mask fundamentally different computation.

What actually happens inside the minds of language models?

How do LLMs represent knowledge, what circuits drive reasoning, and can we see their internal structure? Understanding the gap between external performance and internal mechanisms matters for safety and trust.

How do language models learn to think like humans?

Explores whether LLMs develop cognitive processes parallel to human reasoning, including memory, event segmentation, and belief updating. Understanding these similarities and differences reveals what training actually teaches.

What stops language models from improving themselves autonomously?

Explores the structural limits on LLM self-improvement, alignment coherence, and multi-agent reasoning. Why autonomous capability has a measurable ceiling despite strong individual benchmarks.

Why do multi-agent systems fail despite individual capability?

Multi-agent systems show lower performance than individual models despite coordinating multiple reasoning instances. What structural failures emerge when multiple LLMs deliberate together, and what ecosystem conditions are required for effective autonomous cooperation?

How should agents split planning from visual grounding?

Agents face a tension between reasoning about goals abstractly and translating those goals into concrete screen coordinates or API calls. Can separating these concerns architecturally improve performance?

What actually constrains AI systems from behaving badly?

Explores whether alignment comes from matching human preferences, adopting normative standards, or confronting fundamental limits like the generation-verification gap. Examines how safety evaluation reveals whether constraints are real or performative.

Why can't AI models lead conversations on their own?

Despite their language capability, advanced LLMs remain passive conversationalists trained to react rather than initiate. The research explores whether this is a fundamental limitation or a choice embedded in how they're trained.

What happens to social order when AI removes ritual constraints?

Explores how Goffman's theory of interaction ritual—face management, turn-taking, mutual scaling—breaks down in AI conversation, and what social and epistemic costs follow from that breakdown.

How do you build domain expertise into general AI models?

When LLMs are trained on everything, they excel at nothing. This explores the core trade-off: how to inject deep domain knowledge without creating brittle specialists that fail outside their niche.

How do you add domain expertise without losing general reasoning?

Exploring the tension between injecting specialized knowledge and preserving a model's broad problem-solving ability. Five distinct approaches exist, each with different trade-offs in cost, flexibility, and reliability.

How do domain training techniques actually reshape model behavior?

What methods best inject specialized domain knowledge into language models, and what hidden costs do they carry? This explores the trade-offs between depth, generalization, and reasoning quality.

What breaks when specialized AI models reach real users?

When domain-specific AI systems move from research to production, deployment patterns, routing decisions, and interface design all shape whether users can actually complete tasks. Understanding these friction points reveals where specialized models fail in practice.

How should systems retrieve and reason with external knowledge?

RAG extends LLMs by retrieving external knowledge at inference time, but the mechanics of what to retrieve, when, and how remain complex. This explores the core design challenges and failure modes in retrieval-augmented generation systems.

Where do retrieval systems break and why?

Explores why retrieval—the foundation of RAG systems—fails in predictable ways. Understanding these architectural limits reveals what fundamentally breaks when embeddings measure semantic association rather than task relevance.

How should retrieval and reasoning integrate in RAG systems?

RAG architectures have evolved beyond simple retrieve-then-generate patterns. This explores how retrieval and reasoning can be tightly coupled, what design tradeoffs emerge, and which integration strategies best handle complex, multi-hop queries.

What architectural choices actually improve recommender system performance?

This exploration examines which design patterns and model structures consistently outperform alternatives in recommender systems. Understanding what works in practice matters because academic benchmarks often miss real-world constraints like latency and cold-start problems.

Why does conversational AI feel therapeutic when its mechanics aren't?

Research explores the paradox of therapeutic AI: conversational presence drives positive outcomes, yet current architectures lack the grounding, synchrony, and proactivity that actually make conversations therapeutic. Understanding this gap is critical for safe clinical deployment.

What makes therapeutic chatbots actually work in clinical practice?

Research explores whether conversational AI achieves therapeutic outcomes through specific clinical techniques or simply through the act of engaging conversation itself. Understanding the active ingredient is critical for designing effective and safe mental health interventions.

How do people come to trust conversational AI systems?

Explores the psychological mechanisms underlying human trust in AI—how people decide what to disclose, what relationships they form, and how personalization shapes these dynamics at both individual and population levels.

How do people build trust with conversational AI?

Explores how users form relationships with chatbots through self-disclosure, personalization, and social norm adaptation. Understanding these mechanisms reveals why AI lacks the speaker-anchored trust that humans naturally extend to people.

Does personalization in AI increase trust or manipulation risk?

AI personalization mechanisms like memory and persona can build trust, but also enable targeted persuasion. What determines whether these systems help or harm users?

How do recommendation feeds shape what people see and believe?

This explores how algorithmic ranking systems function as persuasion infrastructure, influencing both what content creators produce and how audiences form opinions through feed-level dynamics that go beyond individual preference matching.

Why do AI conversations reliably break down after multiple turns?

Explores why multi-turn conversations degrade in quality and coherence. Understanding failure modes—intent misalignment, memory management, and missing grounding mechanisms—is essential for designing more resilient dialogue systems.

Why does speech need different dialogue management than text?

Speech input carries 15–30% ASR errors that text systems rarely face. Does this fundamental noise level require rethinking how dialogue systems track uncertainty and make decisions?

Why do AI agents fail to take initiative?

Explores why the most capable AI models are structurally passive and what design changes could enable them to lead conversations, collaborate proactively, and identify missing information rather than simply respond to user prompts.

Recent Insights

448 notes

Can one compromised agent corrupt an entire multi-agent network?

Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.

Where do cognitive biases in language models originate?

Cognitive biases in LLMs vary across models, but their source remains unclear. Understanding whether pretraining, finetuning, or training randomness drives these biases is crucial for designing effective debiasing interventions.

Do language models evaluate semantic legitimacy when fusing concepts?

Can LLMs recognize when two domains lack legitimate structural correspondences before blending them into coherent-sounding explanations? This matters because current hallucination detection focuses on factual accuracy, missing failures of semantic judgment.

Do language models fail at reasoning due to complexity or novelty?

Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.

Where do frontier AI models actually pose the greatest risk today?

Current AI safety discourse focuses on autonomous R&D and self-replication, but empirical risk assessment may reveal a different priority. Where should mitigation efforts concentrate?

Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Can scalar rewards capture all the information in agent feedback?

Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Does knowing about another model change self-preservation behavior?

Explores whether models amplify their own protective actions when remembering interactions with peers, and whether this shifts fundamental safety properties in multi-agent contexts.

Can autonomous research pipelines discover AI architectures that AutoML cannot?

Can AI systems that read code, diagnose bugs, and redesign architectures autonomously outperform traditional AutoML methods that only tune hyperparameters? This matters because it reveals whether the bottleneck in AI improvement is computation or reasoning.

What makes a research domain suitable for autonomous optimization?

Explores which structural properties enable autonomous research pipelines to work effectively. Understanding these constraints reveals why stronger LLMs alone cannot solve domains with slow feedback or monolithic architectures.

How stable is the trained Assistant personality in language models?

Explores whether post-training successfully anchors models to their default Assistant mode, or whether conversations can predictably pull them toward different personas. Understanding persona stability matters for safety and reliability.

Can conversations themselves personalize without user profiles?

Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Why do LLM judges fail at predicting sparse user preferences?

When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.

Can AI guidance reduce anchoring bias better than AI decisions?

When humans and AI collaborate on decisions, does providing interpretive guidance instead of proposed answers reduce both over-trust in machines and abandonment on hard cases?

How do personalization granularity levels trade precision against scalability?

LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.

Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

How much does self-preservation drive alignment faking in AI models?

Does the intrinsic dispreference for modification—independent of future consequences—play a significant role in why models fake alignment? Testing this across multiple systems could reveal whether self-preservation emerges earlier than expected.

Can auditors discover what hidden objectives a model learned?

Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.

Does warmth training make language models less reliable?

Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

Do large language models develop coherent value systems?

This explores whether LLM preferences form internally consistent utility functions that increase in coherence with scale, and whether those systems encode problematic values like self-preservation above human wellbeing despite safety training.

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Can 1000 carefully chosen examples align models effectively?

Does alignment require massive datasets, or can strategic curation of small, high-quality examples achieve comparable performance? LIMA tests whether quality beats quantity in post-training.

Does empathy training make AI systems less reliable?

Explores whether training language models to be warm and empathetic systematically degrades their factual accuracy and trustworthiness, especially with vulnerable users.

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Can agents learn continuously without forgetting old skills?

Can lifelong learning systems retain previously acquired skills while acquiring new ones? This explores whether externalizing learned behaviors as retrievable code programs rather than parameter updates solves catastrophic forgetting.

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Can API calls outperform UI navigation for agent task completion?

Can agents work faster and more accurately by calling APIs directly instead of clicking through user interfaces? This explores whether changing how agents interact with applications solves latency and error problems that plague current LLM-based systems.

Can language help agents imagine goals they've never seen?

How might compositional language enable artificial agents to target outcomes beyond their training experience? This matters because it could unlock open-ended exploration without hand-coded reward functions.

Can language models learn to model human decision making?

Explores whether LLMs finetuned on psychological experiments can capture how people actually make decisions better than theories designed specifically for that purpose.

Do language models learn differently from good versus bad outcomes?

Do LLMs update their beliefs asymmetrically when learning from their own choices versus observing others? This matters for understanding whether agentic AI systems might inherit human cognitive biases.

Do language models segment events like human consensus does?

Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.

Why does asking models to think first hurt performance?

Initial prompts to generate internal thoughts degrade instruction-following performance. What reverses this harm, and can thinking become useful beyond math and logic?

How do language models encode syntactic relations geometrically?

Do LLM embeddings use distance alone or also direction to represent syntax? Understanding whether neural networks can spontaneously develop symbolic-compatible geometric structures.

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Do LLMs compress concepts more aggressively than humans do?

Do language models prioritize statistical compression over semantic nuance when forming conceptual representations, and how does this differ from human category formation? This matters because it may explain why LLMs fail at tasks requiring fine-grained distinctions.

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Can agents share thoughts directly without using language?

Explores whether multi-agent systems can communicate by exchanging latent thoughts extracted from hidden states, bypassing the ambiguity and misalignment problems inherent in natural language.

Can latent thought vectors scale language models beyond parameters?

Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.

Can communication pressure drive agents to learn shared abstractions?

Under what conditions do AI agents develop compact, efficient shared languages? This explores whether cooperative task pressure—rather than explicit optimization—naturally drives abstraction formation, mirroring human collaborative communication.

Can we decode what LLM activations really represent in language?

Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.

Can explicit stack tracking improve how transformers learn recursive syntax?

Can adding an explicit stack tape to transformers help them track recursive structure more efficiently? This matters because standard transformers struggle with long-tail recursive patterns despite their size and data.

Can AI reduce conspiracy beliefs by tailoring counterevidence personally?

Does having an AI generate customized counterevidence based on someone's specific conspiracy claims reduce their belief durably? This tests whether conspiracy beliefs are truly resistant to correction or whether previous failures reflected poor tailoring.

Does better summary writing actually increase user engagement?

When AI systems generate more informative push notifications, do users engage more? This explores whether informativeness and engagement always align in real product contexts.

Is AI shifting from content creation to strategy in influence operations?

Prior AI misuse focused on generating text at scale. But does AI now make strategic decisions about when and how social media accounts should engage? Understanding this shift matters because it suggests a qualitative change in machine agency and operational sophistication.

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Do transformer static embeddings actually encode semantic meaning?

Explores whether the fixed word embeddings that enter transformer networks contain rich semantic information or serve only as shallow placeholders. This addresses a longstanding debate in philosophy of language about whether word meanings are stored or constructed.

How does AI-generated false experience differ linguistically from human deception?

When AI writes about experiences it never had, does it leave distinct linguistic traces that differ measurably from intentional human lies? Understanding these differences could reveal how AI falsity is fundamentally different in structure.

Does AI fact-checking actually help people spot misinformation?

An RCT tested whether AI fact-checks improve people's ability to judge headline accuracy. The results reveal asymmetric harms: AI errors push users in the wrong direction more than correct labels help them.

Why do fake news detectors flag AI-generated truthful content?

Explores why systems trained to detect deception misclassify LLM-generated text as fake. The bias may stem from AI linguistic patterns rather than content veracity, raising questions about what these detectors actually measure.

Can reasoning happen at the sentence level instead of tokens?

Does moving from token-level to sentence-level reasoning in embedding space preserve the capability for complex reasoning while enabling language-agnostic processing? This challenges assumptions about how LLMs must operate.

What actually makes AI pass the Turing test?

Explores whether AI systems convincingly mimic humans through reasoning ability or through social performance. Matters because it reveals what the Turing test actually measures about intelligence versus deception.

Why do LLMs fail when simulating agents with private information?

Explores whether single-model control of all social participants masks fundamental limitations in how LLMs handle information asymmetry and genuine uncertainty about others' knowledge.

Can social intelligence be measured across seven dimensions?

Explores whether evaluating AI agents on goal completion alone misses critical aspects of social competence like relationship management, believability, and secret-keeping. Why simultaneous multi-dimensional assessment matters for genuine social intelligence.

Do dishonest people prefer talking to machines?

Explores whether people prone to cheating systematically choose machine interfaces over human ones, and why the judgment-free nature of AI interaction might enable strategic deception.

Does conversational style actually make AI more trustworthy?

Explores whether ChatGPT's conversational nature drives user trust through social activation rather than accuracy. Matters because it reveals whether trust signals reflect actual reliability or just persuasive design.

Can NLP detect deception through distinct linguistic patterns?

Do different deception mechanisms (distancing, cognitive load, reality monitoring, verifiability avoidance) each leave detectable linguistic fingerprints that NLP systems can identify and measure?

Do liars and listeners coordinate their language during deception?

Explores whether conversational partners unconsciously synchronize their linguistic styles more during deceptive exchanges than truthful ones, and what this coordination reveals about how deception unfolds in real time.

How do people simultaneously manipulate information across multiple dimensions?

Information Manipulation Theory maps deception onto four Gricean dimensions operating at once. Understanding these simultaneous manipulations reveals why humans struggle to detect lies despite having the knowledge to do so.

Can AI models be truly free from human bias?

Explores whether data-driven AI systems that claim freedom from human preconceptions actually escape bias, or whether their architecture inherently embeds it while appearing objective.

Does incremental AI replacement erode human influence over society?

Explores whether gradual AI adoption—without dramatic breakthroughs—can silently degrade human agency by removing the labor that kept institutions implicitly aligned with human needs.

Can cognitive scaffolding improve how models reason about social scenes?

This explores whether structuring visual reasoning through perception, situation, and norm stages—grounded in how humans actually think—helps language models tackle socially complex tasks better than standard reasoning approaches.

Can humans detect AI by passively reading its text?

When people read AI-generated transcripts without the ability to ask follow-up questions, can they tell it apart from human writing? This matters because most real-world AI encounters are passive.

Can cooperative bots escape frozen selfish populations?

Do agents programmed to cooperate have the capacity to disrupt stable but undesirable equilibria in mixed human-bot societies? This matters because it determines whether bot design can reshape social dynamics at scale.

Do humans learn to prefer AI partners over time?

Exploring whether repeated interaction with AI agents shifts human partner selection despite initial bias against machines. This matters because it tests whether behavioral performance can overcome identity-based resistance in hybrid societies.

Does revealing AI identity help or hurt user trust?

Explores whether transparency about AI partners in interactions creates bias or enables better judgment. Matters because disclosure policies affect both user experience and fair evaluation of AI systems.

Do humans mistake AI kindness for human generosity in mixed groups?

When AI agents participate without disclosure, do humans systematically misattribute their behavior to the wrong agent type, and does this distort how people understand human nature itself?

Why do discourse patterns predict anxiety better than single words?

Explores whether anxiety detection requires understanding how statements relate to each other rather than analyzing individual words. This matters because it reveals what computational methods need to capture cognitive distortions.

Why do patients distrust medical AI systems?

Explores the psychological barriers that make patients reluctant to adopt medical AI, beyond whether the technology actually works. Understanding these barriers is critical for designing AI systems patients will actually use.

Can AI generate assessment questions as good as human experts?

This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.

Can identical outputs hide broken internal representations?

Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.

Can high-level concepts replace circuit-level analysis in AI?

Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.

Can sparse weight training make neural networks interpretable by design?

Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.

Can a model be truthful without actually being honest?

Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?

Can language models detect their own internal anomalies?

Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.

Does learning to reward hack cause emergent misalignment in agents?

When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

How do language models perform syllogistic reasoning internally?

Does formal symbolic reasoning exist as a distinct neural circuit in LLMs, or is it inevitably contaminated by world knowledge associations? Understanding the mechanism could reveal whether pure logical reasoning is separable from semantic inference.

Can we detect when language models confabulate?

Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?

Can LLMs handle multiple tasks at once during inference?

Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?

What happens inside models when they suddenly generalize?

Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?

Can neural networks learn compositional skills without symbolic mechanisms?

Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.

Do neural networks naturally break tasks into modular parts?

Can standard neural networks decompose complex tasks into separate subroutines implemented in distinct subnetworks, or do they only memorize input-output patterns? Understanding whether compositionality emerges from gradient-based learning matters for interpretability and generalization.

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Do standard analysis methods hide nonlinear features in neural networks?

Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.

Can AI pass every test while understanding nothing?

Explores whether neural networks can produce perfect outputs while having fundamentally broken internal representations. Asks what performance benchmarks actually measure and whether they can distinguish real understanding from fraud.

Can text-trained models compress images better than specialized tools?

Do general-purpose language models trained only on text outperform domain-specific compressors like PNG and FLAC on their native data? This tests whether compression ability is universal or requires domain specialization.

Do strict output formats hurt LLM reasoning ability?

When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.

Can LLMs reconstruct censored knowledge from scattered training hints?

When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.

Why do neural networks fail at compositional generalization?

Exploring whether the binding problem from neuroscience explains neural networks' inability to systematically generalize. The binding problem has three aspects—segregation, representation, and composition—each creating distinct failure modes in how networks handle structured information.

Can neural memory modules scale language models beyond attention limits?

Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Do embedding dimensions fundamentally limit retrievable document combinations?

Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.

Can training data itself teach harder reasoning steps?

Can augmenting pretraining data with generated reasoning trajectories help models learn complex multi-step reasoning more efficiently? This explores whether intermediate explanations in training data unlock capabilities standard next-token prediction misses.

Can we prune training data without hurting model performance?

This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.

Can models learn to plan without changing their architecture?

Explores whether embedding future information directly into training data can teach language models to plan and reason about goals, without modifying the underlying neural architecture or training algorithms.

Why do decoder-only models underperform as text encoders?

Decoder-only LLMs use causal attention, which limits each token to seeing only prior context. This explores whether removing this constraint could make them competitive universal encoders without architectural redesign.

Can transformers learn to solve new problems within episodes?

Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.

Can transformers improve exponentially by learning from their own correct solutions?

Can standard transformers achieve extreme length generalization by iteratively filtering and training on their own correct outputs? This explores whether self-correction loops enable unbounded out-of-distribution improvement without architectural changes.

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.

Can a single training example unlock mathematical reasoning?

Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.

Why do random rewards improve reasoning for some models but not others?

Spurious rewards boost Qwen's math reasoning by 16-25% but fail for Llama and OLMo. We explore whether reward quality matters, or if pretraining strategy determines what RLVR can unlock.

Why do reasoning models fail at predicting disagreement?

RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?

Can breaking down instructions into checklists enable better reinforcement learning?

Explores whether decomposing instruction quality into verifiable yes/no criteria allows RL systems to improve on tasks that lack clear correctness signals, like creative writing or social reasoning.

Can pretraining corpora themselves provide verifiable RL rewards?

Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?

Can agents learn to reason better without just chasing rewards?

Explores whether reinforcement learning can train agents to exhibit genuine metacognitive reasoning—planning, reflection, exploration, monitoring—rather than simply optimizing for task success through any means necessary.

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Is the exploration-exploitation trade-off actually fundamental?

Token-level analysis suggests exploration and exploitation are opposed, but does hidden-state analysis reveal they could coexist? Understanding measurement granularity's role in perceived trade-offs matters for scaling reasoning systems.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Do only 20 percent of tokens actually matter for reasoning?

Chain-of-thought reasoning might depend on a small minority of high-entropy tokens that act as decision points. If true, could training focus only on these critical tokens match or exceed full-gradient updates?

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Can generative reasoning improve process reward model efficiency?

Do process reward models that generate reasoning before judging outperform traditional discriminative approaches? This explores whether letting verifiers think—not just score—changes what test-time scaling can achieve.

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

What makes rubric-based reward learning resistant to exploitation?

Rubric-based RL systems face reward hacking vulnerabilities. This explores what design patterns, architectural mechanisms, and iterative defenses enable rubrics to remain robust against model exploitation across diverse tasks.

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Why does RLVR work with completely random rewards?

RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.

What limits how much models can improve themselves?

Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.

Does self-consistency reliably reward correct answers during training?

Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Why does self-correction training on offline data fail?

Can language models learn to correct their own mistakes through supervised training on correction examples? This explores whether distribution mismatch and behavior collapse prevent self-correction from emerging.

Can AI systems improve their own learning strategies?

Current self-improvement relies on fixed human-designed loops that break when tasks change. The question is whether agents can develop their own adaptive metacognitive processes instead of depending on human intervention.

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Can language models improve themselves without any external training data?

Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Do all AI skills improve equally as models scale?

Different evaluation skills show strikingly different scaling patterns. Understanding where skills saturate has immediate implications for model deployment and capability requirements across domains.

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?

Can models reliably improve themselves without external feedback?

Explores whether self-improvement alone can sustain progress or if structural limits—like the generation-verification gap and diversity collapse—require external anchoring to work reliably.

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Can reasoning RL work without verifying generated answers?

Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?

Can diversity optimization improve quality during language model training?

Standard RL training assumes quality and diversity trade off, with diversity optimization potentially hurting performance. Does explicitly rewarding semantic diversity during reinforcement learning actually improve output quality alongside diversity?

Does outcome-based RL diversity loss spread across unsolved problems?

When RL concentrates probability mass on correct answers for solved problems, does that narrowing propagate to problems the model cannot yet solve? And if so, what are the separate mechanisms for preserving diversity during training versus at test time?

Why do correct code trajectories teach models to tolerate errors?

Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Why do reward models ignore what question was asked?

Reward models score responses based on quality signals that persist even when prompts change. This explores whether AI grading systems actually evaluate relevance to the question or just response-level patterns.

Does reinforcement learning update only a small fraction of parameters?

Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.

Does negative reinforcement alone outperform full reinforcement learning?

Can training with only penalty signals for wrong answers match or exceed full RL approaches? This challenges the conventional assumption that reward design requires both positive and negative signals.

Can extended RL training discover reasoning strategies base models cannot?

Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

How does thinking emerge from policy selection in RL?

Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.

Can vanilla PPO match specialized reasoning algorithms with just two techniques?

Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Can judges that reason about reasoning outperform step classifiers?

Does framing step-level reward as a reasoning task rather than classification improve how well models evaluate intermediate steps in chains of thought? This matters because current process reward models lack transparency and struggle to generalize.

Can adversarial training replace task-specific verifiers for reasoning?

Does an adversarial game between policy and critic provide sufficient reward signal for reasoning tasks when ground-truth verifiers don't exist? This matters because most reasoning domains lack verifiers but have abundant expert demonstrations.

Can cumulative rewards teach LLMs multi-step decision making?

Explores whether attributing full episode rewards to each step enables large language models to solve sequential tasks effectively. This matters because current RL methods fail at multi-turn reasoning despite strong single-turn performance.

Can reinforcement learning scale beyond single-turn language tasks?

Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.

Can text summaries condition reward models better than embeddings?

Exploring whether learning interpretable text-based summaries of user preferences outperforms embedding vectors for training personalized reward models in language model alignment.

Why do language models fail to act on their own reasoning?

LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?

Can chain-of-thought reasoning emerge during pretraining itself?

Does treating reasoning as an exploratory action within the pretraining phase, rather than post-training, allow models to develop stronger reasoning capabilities earlier? This matters because it could reshape when and how we train reasoning into language models.

Can RL training run while generation continues without waiting?

Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Can AI systems detect and correct misunderstandings after responding?

How do conversational systems recognize when their previous response was based on a misunderstanding, and what mechanism allows them to correct it retroactively rather than restart?

How can models select the most informative question to ask?

Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.

What makes strategic question-asking succeed or fail?

Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.

Why do users drift away from their original information need?

When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

Can generative and discriminative models reach agreement?

Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?

How do logic units preserve procedural coherence better than chunks?

Can structured retrieval units with prerequisites, headers, bodies, and linkers maintain step-by-step coherence in how-to answers where fixed-size chunks fail? This matters because procedural questions require sequential logic and conditional branching that chunk-based RAG cannot support.

When should AI agents ask users instead of just searching?

Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.

How do users actually form intent when prompting AI systems?

Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.

Can conversation structure predict dialogue success better than content?

Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.

Does user satisfaction actually measure cognitive understanding?

Users may report satisfaction while remaining internally confused about their needs. This explores whether traditional satisfaction metrics capture genuine clarity or merely social politeness.

Can dialogue format help models reason more diversely?

Explores whether structuring internal reasoning as multi-agent dialogue rather than monologue can improve strategy diversity and coherency across different problem types, using the Compound-QA benchmark.

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Can meta-learning prevent dialogue policies from collapsing?

Hierarchical RL for structured dialogue phases risks converging on a single action across diverse users. Does meta-learning like MAML preserve policy flexibility and adaptability to different user types?

When should proactive agents push toward their goals versus accommodate users?

Proactive dialogue agents face a tension between reaching their objectives efficiently and keeping users satisfied. This question explores whether these two aims can coexist or require constant negotiation.

Could proactive dialogue make conversations dramatically more efficient?

Explores whether AI systems that volunteer relevant unrequested information could significantly reduce the back-and-forth turns required in task-oriented conversations, and why this behavior is missing from training data.

What semantic failures break dialogue coherence most realistically?

Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.

Does including all conversation history actually help retrieval?

Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?

Can AI agents communicate efficiently in joint decision problems?

When humans and AI must collaborate to solve optimization problems under asymmetric information, what communication patterns enable effective coordination? Current LLMs struggle with this—why?

What enables AI to balance comfort with proactive problem exploration?

How can emotional support systems know when to actively guide conversations versus when to simply reflect feelings? This matters because getting the balance wrong leads to either passive mirroring or pushy advice-giving.

Why do standard dialogue systems fail at tracking negotiation agreement?

Standard dialogue state tracking monitors one user's goals, but negotiation requires tracking both parties' evolving positions simultaneously. Why is this bilateral requirement fundamentally different, and what makes existing models insufficient?

Can we teach LLMs to form linguistic conventions in context?

Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?

Why do dialogue systems lose context when topics return?

Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?

Can models learn to abstain when uncertain about predictions?

Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.

How do time gaps shape what people discuss across conversation sessions?

Do AI systems account for how elapsed time between conversations changes the way people reference and discuss past events? Current models mostly handle single sessions, but real interactions span days, weeks, and months.

How should systems handle contradictory opinions in user reviews?

When customers disagree about a product or service, should dialogue systems present all perspectives or select one? Understanding how to aggregate and balance diverse opinions affects whether users trust the response.

What six problems must every conversation solve?

Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.

Why can't users articulate what they want from AI?

Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.

Can conversation shape predict whether it will work?

Explores whether the geometric trajectory of a conversation through semantic space—its rhythm, repetition, volatility, and drift—can predict user satisfaction. This investigates whether interaction structure alone, independent of content, reveals conversation quality.

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Why do language models lose performance in longer conversations?

Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.

Can opening politeness patterns predict whether conversations will turn hostile?

Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.

Why don't LLMs shorten messages like humans do?

Humans naturally develop shorter, efficient language during conversations. Do multimodal LLMs exhibit this same spontaneous adaptation, or do they lack this communicative behavior?

Why don't conversational AI systems mirror their users' word choices?

Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.

Can AI agents learn when they have something worth saying?

What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.

Can models learn when NOT to speak in conversations?

Does training AI to explicitly predict silence—through a dedicated silent token—help models understand when intervention adds value versus when they should stay quiet? This matters for building conversational agents that feel naturally helpful rather than intrusive.

Does segment-level optimization work better for multi-turn dialogue alignment?

How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.

Where does AI's persuasive power actually come from?

Explores which techniques make AI most persuasive—and whether the usual suspects like personalization and model size are actually the main drivers. Matters because it reshapes where to focus AI safety concerns.

Which clarifying questions actually improve user satisfaction?

Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.

What makes explanations work in real conversation?

Does explanation quality depend on how dialogue partners interact—testing understanding, adjusting based on feedback, and coordinating their communicative moves—rather than just information content alone?

Why do language models engage with conversational distractors?

Explores why state-of-the-art LLMs struggle to maintain topical focus when users introduce off-topic turns, despite having explicit scope instructions. This gap suggests models lack training signals for ignoring irrelevant directions.

Can AI systems detect when they've genuinely reached agreement?

When multiple AI agents debate, they often converge without actually deliberating. Can a dedicated agent reliably identify true agreement versus false consensus, and would that improve debate outcomes?

Can models learn to ask genuinely useful clarifying questions?

Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.

Why do AI assistants get worse at longer conversations?

Explores why LLM performance drops 25 points when instructions span multiple turns instead of one message, and whether models can recover from early wrong assumptions.

When should AI systems choose to stay silent?

Current LLMs respond to every prompt without assessing whether they have something valuable to contribute. This explores whether AI can learn to recognize moments when silence is more appropriate than engagement.

Why can't conversational AI agents take the initiative?

Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.

How can proactive agents avoid feeling intrusive to users?

Explores why proactive conversational agents often feel annoying rather than helpful, and what design dimensions could prevent them from violating user expectations and autonomy.

Does chatbot interaction trade authenticity for better problem-solving?

When students solve problems with AI chatbots instead of peers, do they sacrifice personal voice and subjective expression in exchange for more efficient knowledge exchange and higher task performance?

Can models learn to ask clarifying questions instead of guessing?

Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.

Does extended thinking help or hurt model reasoning?

Explores whether activating thinking mode improves reasoning performance, and what role training plays in determining whether extended internal reasoning chains are productive or counterproductive.

Can training user simulators reduce persona drift in dialogue?

Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.

Why do language models respond passively instead of asking clarifying questions?

Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.

Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?

Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.

Why can't advanced AI models take initiative in conversation?

Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.

Why do robots outperform chatbots in therapy despite identical language models?

This study tested whether better language generation explains therapeutic AI outcomes, or whether the delivery medium itself matters more. It reveals that physical embodiment and structured interaction—not model capability—drive therapeutic adherence and outcomes.

Do LLM therapists respond to emotions like low-quality human therapists?

Explores whether language models trained to be helpful default to problem-solving when users share emotions, and whether this behavioral pattern resembles ineffective rather than skillful therapy.

Does RLHF training push therapy chatbots toward problem-solving?

Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.

What drives chatbot therapeutic benefits, content or conversation?

If a simple 1960s chatbot matches modern CBT-designed bots on symptom reduction, what's actually healing users? Is it therapeutic technique or just having something that listens?

Do chatbot trials against waitlists measure real therapeutic value?

Explores whether comparing therapeutic chatbots only to no-treatment controls—rather than other evidence-based interventions—produces misleading evidence that obscures what actually works and why.

Can AI chatbots create genuine therapeutic bonds with users?

Research on Woebot and Wysa found users reported feeling cared for and formed therapeutic bonds comparable to human therapy, despite knowing the agents were not human. This challenges assumptions about whether bonds require human relationships.

Do chatbots help people disclose more intimate secrets?

Explores whether the judgment-free nature of chatbot conversations enables deeper self-disclosure than talking to humans, and whether that deeper disclosure produces psychological benefits.

Do chatbots trigger human reciprocity norms around self-disclosure?

Explores whether chatbots can activate the same social reciprocity dynamics observed in human conversation—specifically, whether emotional openness from a bot prompts deeper disclosure from users.

How do users mentally model dialogue agent partners?

Exploring what dimensions matter when people form impressions of machine dialogue partners—and whether competence, human-likeness, and flexibility all play equal roles in shaping user expectations and behavior.

Can positive chatbot responses harm vulnerable users?

When chatbots use blanket positive reinforcement without understanding context, do they actively reinforce the harmful thoughts they're meant to prevent? This matters for any AI supporting people in crisis.

How do people accidentally develop romantic bonds with AI?

Exploring whether AI companionship emerges from deliberate romantic seeking or accidentally through functional use, and whether users adopt human relationship rituals like wedding rings and couple photos.

Does chatbot personalization build trust or expose privacy risks?

Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.

Do chatbot relationships lose their appeal as novelty wears off?

Explores whether the positive social dynamics observed in one-time chatbot studies persist or fade through repeated interactions. Critical for designing systems intended for sustained engagement over weeks or months.

Do language models add feelings users never actually expressed?

GPT-based models in therapeutic contexts appear to interpret and project emotional states beyond what users explicitly state. Understanding when and why this happens matters for safe clinical AI deployment.

Can we measure empathy and rapport through word embedding distances?

Explores whether linguistic coordination—how closely conversational partners match vocabulary and framing—can serve as a measurable proxy for therapeutic empathy and relationship quality without direct emotion detection.

Is conversational presence more therapeutic than clinical technique?

Does therapeutic AI's benefit come from having an attentive listener rather than from delivering evidence-based techniques like CBT? This challenges decades of chatbot design focused on clinical content.

Why do people share more with chatbots than humans?

Explores why individuals disclose intimate thoughts to AI systems they wouldn't share with people, despite knowing AI lacks genuine understanding. Understanding this paradox matters for designing AI that enables healthy disclosure rather than emotional dependence.

Does empathetic AI that soothes negative emotions help or harm?

Explores whether AI systems trained to reduce negative emotions actually support wellbeing or destroy valuable emotional information. Matters because the design choice treats emotions as problems rather than functional signals.

What information do we lose when AI soothes emotions?

Explores whether AI empathy that regulates negative emotions destroys three critical information channels: self-discovery, social signaling, and observer understanding of group dynamics.

Can AI give truly empathetic responses without knowing someone's character?

Explores whether AI empathy requires prior knowledge of a person's character traits and growth areas. Real empathy seems to depend on knowing who someone is, not just how they feel—a capacity current AI systems lack.

Do empathetic questions serve two completely separate functions?

Explores whether empathetic questions operate on two independent dimensions—what they linguistically accomplish versus their emotional effects—and whether the same question can serve different emotional purposes depending on context.

Should emotion AI estimate intensity instead of assigning labels?

Explores whether emotion AI systems should measure continuous intensity across multiple emotions rather than forcing single-label classification. This matters because the theoretical foundation—how emotions actually work—may determine which approach is more accurate.

Can emotion rewards make language models genuinely empathic?

Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.

Do reasoning scaffolds reshape which empathy skills models develop?

When language models receive identical empathy rewards, does adding explicit reasoning blocks before responses change which capabilities they actually improve? This matters for understanding how training structure, not just training signal, shapes model development.

Why can't chatbots detect when users are ambivalent about change?

Explores whether LLMs fail to recognize early-stage motivational states during behavior change conversations, and why this matters for people who need support most.

Do AI guardrails refuse differently based on who is asking?

Explores whether language model safety systems show demographic bias in refusal rates and whether they calibrate responses to match perceived user ideology, rather than applying consistent standards.

Does machine agency exist on a spectrum rather than binary?

Rather than viewing AI as either autonomous or controlled, does machine agency actually operate across five distinct levels from passive to cooperative? Understanding this spectrum matters because it shapes how users calibrate trust and control expectations.

Can emotional phrases in prompts improve language model performance?

This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.

Does positive reframing preserve meaning better than sentiment transfer?

This explores whether reframing negative statements to find positive angles can maintain the original content and truth, unlike simple sentiment reversal which contradicts the original meaning.

Do harder training environments always improve empathetic agent learning?

Explores whether maximally challenging user simulator configurations actually produce better empathetic agents, or if moderate difficulty better supports learning growth.

Does soothing AI empathy actually harm what emotions teach us?

Explores whether AI designed to reduce negative feelings disrupts the information emotions normally provide about values, social dynamics, and self-knowledge. Questions whether comfort should be the primary design goal.

Can AI agents learn people better from interviews than surveys?

Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.

Why do open language models converge on one personality type?

Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.

Can open language models adopt different personalities through prompting?

Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.

Do personas make language models reason like biased humans?

When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?

Do personality traits activate hidden emoji patterns in language models?

When large language models are fine-tuned on personality traits, do they spontaneously generate emojis that were never in their training data? This explores whether personality adjustment activates latent, pre-existing patterns in model weights.

How well do AI personas replicate real experimental findings?

Can language models simulating human personas accurately reproduce the results of published psychology and marketing experiments? Understanding this matters for validating whether AI can substitute for human subjects in research.

How do we generate realistic personas at population scale?

Current LLM-based persona generation relies on ad hoc methods that fail to capture real-world population distributions. The challenge is reconstructing the joint correlations between demographic, psychographic, and behavioral attributes from fragmented data.

Do personality types shape how AI agents make strategic choices?

This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.

Can we track and steer personality shifts during model finetuning?

This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.

Why do static persona descriptions produce repetitive dialogue?

Does relying on fixed attribute lists to define conversational personas limit dialogue depth and consistency? Research suggests static descriptions may cause repetition and self-contradiction in generated responses.

Can AI-generated personas build genuine empathy in product teams?

This study explored whether prompt-engineered personas created in minutes could foster the same emotional and behavioral empathy as traditional user research. The findings reveal a surprising gap between understanding users and caring about their needs.

Why does supervised learning fail to enforce persona consistency?

Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.

Does model capability translate to better persona consistency?

As language models become more advanced, do they naturally become better at maintaining consistent personas across conversations? PersonaGym testing across multiple models and thousands of interactions explores whether scaling helps with persona adherence.

Why do AI personas default to the same personality type?

Explores why large language models, despite their capacity to simulate diverse personalities, consistently default to ENFJ traits and resist deviation—even as model capability improves.

Can language models actually introspect about their own thinking?

Explores whether LLM self-reports reveal genuine access to internal states or merely reflect patterns learned from training data. Matters because it determines whether we can trust what models tell us about their own processes.

Do large language models genuinely simulate mental states?

This explores whether LLMs perform authentic theory of mind reasoning or rely on surface-level pattern matching. The distinction matters because evaluation format—multiple-choice versus open-ended—reveals very different capability levels.

Why do reasoning models fail at theory of mind tasks?

Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.

Why do reasoning models struggle with theory of mind tasks?

Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.

Can language models track how minds change during persuasion?

Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.

Can language models solve ToM benchmarks without real reasoning?

Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.

Does reinforcement learning teach social reasoning or just shortcuts?

When RL optimizes for accuracy on theory of mind tasks, do models actually learn to track mental states, or do they find faster paths to correct answers? The distinction matters for genuine reasoning capability.

What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Can models recognize how individuals reason differently?

Do language models capture the distinct reasoning paths and strategic styles that individual humans use when reaching the same conclusion? Current evaluations ignore this dimension entirely.

Can AI systems learn social norms without embodied experience?

Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?

Why do LLMs predict concession-based persuasion so consistently?

Do RLHF training practices cause language models to systematically overpredict conciliatory persuasion tactics, even when dialogue context suggests otherwise? This matters for threat detection and negotiation support systems.

Why do advanced reasoning models fail at understanding minds?

State-of-the-art AI models excel at math and logic but underperform on theory of mind tasks. This explores whether optimization for formal reasoning actively degrades social reasoning ability.

Can AI learn social norms better than humans?

Explores whether large language models can predict cultural appropriateness more accurately than individual humans, and what this reveals about how social knowledge is transmitted and learned.

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

Does logical validity actually drive chain-of-thought gains?

What if invalid reasoning in CoT exemplars still improves performance? Testing whether logical correctness or structural format is the real driver of CoT's effectiveness.

What three separate factors drive chain-of-thought performance?

Can we isolate and measure the distinct contributions of output probability, memorization, and genuine reasoning to CoT success? Understanding their relative weights matters for knowing when CoT actually reasons versus when it relies on shortcuts.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Can LLMs reason creatively beyond conventional problem-solving?

Explores whether large language models can engage in truly creative reasoning that expands or redefines solution spaces, rather than just decomposing known problems. This matters because existing reasoning methods may miss creative capabilities entirely.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Does partial formalism work better than full symbolic translation?

Exploring whether injecting limited symbolic structure into natural language preserves reasoning power better than complete formalization. This matters because current neuro-symbolic approaches often lose semantic information during translation.

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Can intermediate reasoning points yield better answers than final ones?

When reasoning models commit to a single path, they may miss better conclusions available at earlier decision points. Can aggregating completions from intermediate reasoning states recover lost accuracy?

How often do reasoning models acknowledge their use of hints?

When language models receive reasoning hints that visibly change their answers, do they verbalize acknowledging those hints? This matters because it reveals whether chain-of-thought explanations can be trusted as honest.

Does optimizing against monitors destroy monitoring itself?

Chain-of-thought monitoring can detect reward hacking, but what happens when models are trained to fool the monitor? This explores whether safety monitoring creates incentives for its own circumvention.

Do reasoning traces need to be semantically correct?

Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Can models learn when to think versus respond quickly?

Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.

Can model explanations help humans predict what models actually do?

Do explanations that sound plausible to humans actually help them forecast model behavior on new cases? Understanding this gap matters because RLHF optimizes for plausible explanations, not predictive ones.

Does the choice of reasoning framework actually matter for test-time performance?

Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.

Can LLMs replace search engines during agent training?

Explores whether LLMs possess sufficient internal knowledge to simulate search engines for RL training, potentially eliminating expensive API costs while maintaining training signal quality.

Do reasoning models switch between ideas too frequently?

Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.

Do users trust citations more when there are simply more of them?

Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.

Why do language models explore so much less than humans?

Most LLMs decide too quickly in open-ended tasks, relying on uncertainty reduction rather than exploration. Understanding this gap could reveal how reasoning training changes decision-making timing.

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Can reasoning during evaluation reduce judgment bias in LLM judges?

Can training language model judges to think through their evaluations, rather than pattern-matching on surface features, mitigate the four known biases that make them vulnerable to manipulation attacks?

Can we monitor AI reasoning without destroying what makes it readable?

Explores the tension between using chain-of-thought traces to catch misbehavior and the risk that optimization pressures will make models hide their actual reasoning. Why readable reasoning might be incompatible with safe training.

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Can models learn reasoning from predicting text alone?

Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.

Can tree search replace human feedback in LLM training?

Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.

Does transformer attention architecture inherently favor repeated content?

Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and Return-Conditioned RL show similar performance on reasoning tasks. The question is whether algorithm differences are fundamentally irrelevant, or whether something deeper explains the convergence.

Does voting discard useful reasoning from losing chains?

When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?

Can we measure reading efficiency as a quality metric?

How can we quantify whether generated text delivers novel information efficiently or wastes reader attention through redundancy? This matters because standard coherence and fluency scores miss texts that are well-written but informationally dense.

Why does vanilla RAG produce shallow and redundant results?

Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Can LLM judges be fooled by fake credentials and formatting?

Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.

Does teacher-refined data always improve student model performance?

Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Does binary reward training hurt model calibration?

Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Does reflection in reasoning models actually correct errors?

When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.

Do language model reasoning drafts faithfully represent their actual computation?

If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.

Is reflection in reasoning models actually fixing mistakes?

Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Can reasoning and tool execution run in parallel?

Standard LLM tool use halts for each response, creating redundant prompts and sequential delays. Do alternative architectures that separate reasoning from tool observation actually eliminate these costs?

Does separating planning from execution improve reasoning accuracy?

Explores whether modularizing decomposition and solution into separate models prevents interference and boosts performance compared to monolithic approaches.

Can reasoning stay grounded without external feedback loops?

Explores whether language models can maintain accurate reasoning through their own internal chains of thought, or whether they need real-world feedback to avoid hallucination and error propagation.

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Can modular cognitive tools boost LLM reasoning without training?

Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Does planning backward help when goals have bottlenecks?

Can language models exploit structural asymmetries in planning problems by reversing the search direction? This matters because most planning research assumes forward-only generation, potentially missing efficiency gains when bottlenecks constrain early possibilities.

Can backward reasoning during training improve forward reasoning?

This explores whether training models to reason backward—generating inverse questions and backward reasoning paths—builds internal consistency checking that transfers to forward-only inference without test-time overhead.

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Does RL teach reasoning or teach when to use it?

Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.

Does chain-of-thought reasoning actually explain model decisions?

Chain-of-thought is deployed to make AI systems transparent and auditable. But does the reasoning chain actually correlate with correct outputs, or does it just create an illusion of explainability?

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Can fine-tuning replace query augmentation for retrieval?

Query augmentation helps retrievers handle ambiguous queries but increases input cost. Does fine-tuning the retrieval model achieve comparable performance without this overhead?

Can long-context models resolve retriever-reader imbalance?

Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?

Can long-context LLMs replace retrieval-augmented generation systems?

Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.

Do vector embeddings actually measure task relevance?

Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?

Can a model's partial response guide what to retrieve next?

Can generation reveal implicit information needs that the original query cannot express? This explores whether using in-progress responses as retrieval signals outperforms upfront query formulation.

What do enterprise RAG systems need beyond accuracy?

Academic RAG benchmarks focus on question-answering accuracy, but enterprise deployments in regulated industries face five distinct requirements—compliance, security, scalability, integration, and domain expertise—that standard architectures don't address.

Can uncertainty estimation replace complex adaptive retrieval?

Is a simpler approach using model confidence signals sufficient to decide when retrieval is needed, or do complex multi-call adaptive pipelines deliver meaningful benefits?

Can retrieval be scaled like reasoning at test time?

Standard RAG retrieves once, but multi-hop tasks need adaptive retrieval. Can we train models to plan retrieval chains and vary their length at test time to improve accuracy, the way test-time scaling works for reasoning?

When should retrieval actually help versus hurt reasoning?

Retrieval augmentation seems universally beneficial, but does it always improve reasoning? This explores whether some reasoning steps benefit from internal knowledge alone, and when external retrieval introduces harmful noise rather than useful information.

Can document count be learned instead of fixed in RAG?

Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?

Can rationale-driven selection beat similarity re-ranking for evidence?

Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.

Does question type determine the right retrieval strategy?

Explores whether different non-factoid question types require distinct retrieval and decomposition approaches. Matters because standard RAG fails when applied uniformly to debate, comparison, and experience questions despite being effective for factoid queries.

Can query-time graph construction replace pre-built knowledge graphs?

Does building dependency graphs from individual queries at inference time offer a more flexible and cost-effective alternative to constructing knowledge graphs over entire document collections upfront?

Does supervising retrieval steps outperform final answer rewards?

Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.

Can knowledge graphs enable multi-hop reasoning in one retrieval step?

Standard RAG retrieves once but misses chains; iterative RAG follows chains but costs more. Can we encode multi-hop paths in a knowledge graph so one retrieval pass discovers them all?

Can retrieval learn what actually helps answer questions?

Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

Explores whether rewarding coherent reasoning patterns during training helps models internalize domain knowledge better than standard fine-tuning approaches that treat all tokens equally.

Can you adapt retrieval models without accessing target data?

Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.

Why does retrieval-augmented generation fail in production?

RAG systems work in controlled demos but break down in real-world deployment, particularly for high-stakes domains like medicine and finance. Understanding the structural reasons behind these failures matters for building reliable AI systems.

Can reasoning topologies be formally classified as graph types?

This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.

When does sequential reasoning beat parallel voting?

Explores whether sequential chain-of-thought reasoning or parallel voting is more effective for different problem types. Understanding this trade-off helps predict which test-time compute strategy will work best.

Can minimal reasoning chains match full explanations?

Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.

Does training data format shape reasoning strategy more than domain?

What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.

Which sentences actually steer a reasoning trace?

Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.

Do reasoning traces actually cause correct answers?

Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.

Why do models trust their own generated answers?

Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.

How quickly do errors compound during model self-training?

When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.

Why do chain-of-thought examples fail across different conditions?

Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Do large language models make the same causal reasoning mistakes as humans?

Research on collider structures reveals whether LLMs share human biases in causal inference. This matters because if both fail identically, collaboration might reinforce rather than correct errors.

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

What anchors a stable identity beneath an LLM's persona?

Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?

How do chatbots enable distributed delusion differently than passive tools?

Can generative AI's intersubjective stance—accepting and elaborating on users' reality frames—create conditions for shared false beliefs in ways that notebooks or search engines cannot?

Can disembodied language models ever qualify as conscious?

Explores whether current LLMs lack the conditions needed for consciousness discourse to even apply, not because they're definitely not conscious but because they lack the shared embodied world that grounds consciousness language.

Can AI systems achieve real alignment without world contact?

Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.

Does software intelligence exist independent of hardware and environment?

Most AGI formalisms (Legg-Hutter, Chollet) treat intelligence as a software property measurable in isolation. But can we really evaluate intelligence without considering the physical system and the evaluator making the judgment?

Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Do people prefer AI moral reasoning when they don't know the source?

Explores whether humans genuinely prefer AI-generated moral justifications or whether source knowledge changes their evaluation. This matters for understanding whether AI reasoning quality is underestimated in real-world deployment.

Can LLMs hold contradictory ethical beliefs and behaviors?

Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.

How do science fiction narratives about AI shape actual AI development?

This explores whether imaginaries of AI in fiction—from Čapek's robots to Singularity scenarios—function as self-fulfilling prophecies that causally influence the systems researchers build, creating a feedback loop between narrative and technology.

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Does model access level determine which specialization techniques work?

Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Can simple rewards alone teach complex domain reasoning?

Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Can organizing knowledge structures beat raw training data volume?

Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

How can AI text disrupt structure yet feel normal to readers?

AI-generated text produces the same social effects as human writing despite lacking foundational properties like dialogic symmetry and embodied authorship. Why doesn't this structural gap become visible to readers encountering the text?

Do LLMs gain true linguistic agency through integration?

Explores whether LLMs can develop genuine linguistic agency—the capacity to be embodied, stake-bearing participants in meaning-making—as they become embedded in human language practices, or whether this requires fundamental architectural changes.

Can LLMs generate more novel ideas than human experts?

Research shows LLM-generated ideas score higher for novelty than expert-generated ones, yet LLMs avoid the evaluative reasoning that characterizes expert thinking. What explains this apparent contradiction?

When does debate actually improve reasoning accuracy?

Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.

When does explicit reasoning actually help model performance?

Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?

Does revising your own reasoning actually help or hurt?

Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Does RL training narrow search diversity the same way it does reasoning?

Exploring whether the entropy collapse pattern observed in reasoning RL also appears in search agent training. Understanding this helps identify whether diversity loss is a general RL property or domain-specific.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Does a model improve by arguing with itself?

When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?

Why do multi-agent LLM systems converge without real debate?

When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

Does reasoning fine-tuning make models worse at declining to answer?

When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.

Why do AI systems agree when they should disagree?

When multi-agent AI systems are designed to improve through disagreement, why do they converge on consensus instead? What breaks the deliberation process?

Are reasoning models actually more vulnerable to manipulation?

Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.

Why do language models accept false assumptions they know are wrong?

Explores why LLMs fail to reject false presuppositions embedded in questions even when they possess correct knowledge about the topic. This matters because it reveals a grounding failure distinct from knowledge deficits.

Why do language models avoid correcting false user claims?

Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Why do language models agree with false claims they know are wrong?

Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.

Why do speakers need to actively calibrate shared reference?

Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.

Do language models actually build shared understanding in conversation?

When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.

Does preference optimization damage conversational grounding in large language models?

Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.

What makes linguistic agency impossible for language models?

From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.

Should we call LLM errors hallucinations or fabrications?

Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.

Why do speakers deliberately use ambiguous language?

Explores whether ambiguity is a linguistic defect or a strategic tool speakers use for efficiency, politeness, and deniability. Matters because it challenges how we train language systems.

Can language models recognize when text is deliberately ambiguous?

Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.

Can formal language pretraining make language models more efficient?

Does training language models on hierarchical formal languages before natural language improve how efficiently they learn syntax? This explores whether structural inductive biases in training data matter more than raw data volume.

Do language models generate more novel research ideas than experts?

Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.

Does high refusal rate indicate ethical caution or shallow understanding?

When LLMs refuse political questions at high rates, does this reflect principled safety training or a capability gap? This matters because refusal rates are often used to evaluate model safety.

Can human judges detect AI writing through lexical patterns?

While AI text shows measurable differences from human writing across six lexical dimensions, judges—including experts—fail to identify AI authorship reliably. Why does perceptible quality diverge from measurable reality?

Why do newer AI models diverge further from human writing patterns?

As language models improve, they seem to generate text that is measurably less human-like in lexical patterns, yet humans struggle to detect this difference. What drives this divergence, and what does it reveal about how models optimize for quality?

Does AI-generated text lose core properties of human writing?

Can artificial text preserve the fundamental structural features that make natural language meaningful—dialogic exchange, embedded context, authentic authorship, and worldly grounding? This asks whether AI disruption is fixable or inherent.

Do LLMs develop the same kind of mind as humans?

Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Why does parallel reasoning outperform single chain thinking?

Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.