Why does speech need different dialogue management than text?
Speech input carries 15–30% ASR errors that text systems rarely face. Does this fundamental noise level require rethinking how dialogue systems track uncertainty and make decisions?
How text-based conversations fail across multiple turns and what architectural patterns prevent degradation.
Speech input carries 15–30% ASR errors that text systems rarely face. Does this fundamental noise level require rethinking how dialogue systems track uncertainty and make decisions?
UserBench explores how often AI models fully understand user intent across multi-turn interactions. The study reveals that human communication is underspecified, incremental, and indirect — traits that challenge current models to actively clarify goals.
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
Do pragmatic politeness features in first exchanges—hedging, greetings, indirectness—reliably signal whether a conversation will later derail into personal attacks? Understanding early linguistic markers could help identify and prevent online hostility.
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?
Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.
Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.
Does explanation quality depend on how dialogue partners interact—testing understanding, adjusting based on feedback, and coordinating their communicative moves—rather than just information content alone?
Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.
When humans and AI must collaborate to solve optimization problems under asymmetric information, what communication patterns enable effective coordination? Current LLMs struggle with this—why?
Standard dialogue state tracking monitors one user's goals, but negotiation requires tracking both parties' evolving positions simultaneously. Why is this bilateral requirement fundamentally different, and what makes existing models insufficient?
Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?
Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
Users may report satisfaction while remaining internally confused about their needs. This explores whether traditional satisfaction metrics capture genuine clarity or merely social politeness.
Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.
Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.
Do AI systems account for how elapsed time between conversations changes the way people reference and discuss past events? Current models mostly handle single sessions, but real interactions span days, weeks, and months.
When customers disagree about a product or service, should dialogue systems present all perspectives or select one? Understanding how to aggregate and balance diverse opinions affects whether users trust the response.
Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.
Conversational memory systems struggle with questions that reference when something was discussed rather than what was said. Standard vector databases lack temporal indexing to retrieve by metadata like date, speaker, or session order.
Instead of storing and retrieving discrete memories, can a single LLM compress all past conversations into event recaps, user portraits, and relationship dynamics? This explores whether compression-based memory avoids the bottleneck of traditional retrieval systems.
When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?
Emotion recognition systems assume that detecting emotional moments will identify what people remember. But does observed emotion in group settings actually predict individual memorability, or does the proxy fail?
Agent memory management splits between agents autonomously recognizing important information versus programmatic triggers. Understanding this choice reveals why different memory architectures prioritize different information types.
Can organizing agent memory around entities and separating episodic events from semantic knowledge enable more natural, preference-aware assistance without constant clarification?
Current LLM summarization treats all meeting participants the same, but organizational contexts require personalized recaps. What barriers prevent systems from learning what matters to each person?
Explores whether prompts fundamentally change how context gets established between humans and LLMs, compared to how people negotiate shared understanding in ordinary dialogue.
Explores whether large language models can participate symmetrically in Stalnaker's picture of communication, where speakers mutually revise shared assumptions. The question matters because it reveals whether human-LLM dialogue is genuinely interactive or structurally asymmetrical.
LLMs face a structural tension: retaining too much context causes different threads to blur together, while retaining too little causes the model to lose track of earlier commitments. This explores whether this dilemma is fundamental to how transformers work.
When users fail to specify contextual details in prompts, do LLMs collapse multiple training contexts into a single generic response? Understanding this failure mode could improve how we scaffold user-model interaction.
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
Explores whether safety-aligned language models might fail at genuine conversation despite passing ethical benchmarks. This matters because pragmatic incompetence can erode trust and cause real harms in high-stakes domains.
Explores why LLM performance drops 25 points when instructions span multiple turns instead of one message, and whether models can recover from early wrong assumptions.
Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.
Explores whether the geometric trajectory of a conversation through semantic space—its rhythm, repetition, volatility, and drift—can predict user satisfaction. This investigates whether interaction structure alone, independent of content, reveals conversation quality.