Conversational AI Systems

Research on dialogue systems, conversational agents, and multi-turn interaction architectures. Covers topics including conversation structure, personalization, synthetic dialog generation, and conversational recommenders, studied by NLP and human-computer interaction researchers.

57 notes (primary) · 271 papers · 8 sub-topics

View as

Conversation Architecture and Structure

13 notes

Why do standard dialogue systems fail at tracking negotiation agreement?

Standard dialogue state tracking monitors one user's goals, but negotiation requires tracking both parties' evolving positions simultaneously. Why is this bilateral requirement fundamentally different, and what makes existing models insufficient?

Can models learn to abstain when uncertain about predictions?

Explores whether language models can be trained to recognize when they lack sufficient information to forecast conversation outcomes, rather than forcing uncertain predictions into confident-sounding responses.

Can conversation structure predict dialogue success better than content?

Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.

Can AI agents communicate efficiently in joint decision problems?

When humans and AI must collaborate to solve optimization problems under asymmetric information, what communication patterns enable effective coordination? Current LLMs struggle with this—why?

Why do dialogue systems lose context when topics return?

Stack-based dialogue management removes topics after they're resolved, making it hard for systems to reference them later. Does this structural rigidity explain why conversational AI struggles with topic revisitation?

When should AI agents ask users instead of just searching?

Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.

Can we teach LLMs to form linguistic conventions in context?

Humans naturally shorten references as conversations progress, but LLMs don't adapt their language for efficiency even when they understand their partners do. Can training on coreference patterns teach this convention-forming behavior?

Could proactive dialogue make conversations dramatically more efficient?

Explores whether AI systems that volunteer relevant unrequested information could significantly reduce the back-and-forth turns required in task-oriented conversations, and why this behavior is missing from training data.

Does including all conversation history actually help retrieval?

Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?

What six problems must every conversation solve?

Schegloff's Conversation Analysis identifies six universal organizational challenges that speakers navigate in all talk-in-interaction. Understanding these helps explain why current AI dialogue systems fall short of human fluency.

Why can't users articulate what they want from AI?

Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.

How do time gaps shape what people discuss across conversation sessions?

Do AI systems account for how elapsed time between conversations changes the way people reference and discuss past events? Current models mostly handle single sessions, but real interactions span days, weeks, and months.

Can conversation shape predict whether it will work?

Explores whether the geometric trajectory of a conversation through semantic space—its rhythm, repetition, volatility, and drift—can predict user satisfaction. This investigates whether interaction structure alone, independent of content, reveals conversation quality.

Dialog Topics and Modeling

8 notes

Which clarifying questions actually improve user satisfaction?

Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.

Does linguistic alignment shape how users perceive AI relationships?

Can conversational AI build relational trust and partnership through real-time linguistic accommodation, or is warmth only surface-level styling? This explores whether alignment is foundational to how users categorize AI as tool versus partner.

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Why do language models lose performance in longer conversations?

Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.

Does segment-level optimization work better for multi-turn dialogue alignment?

How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.

Why do AI assistants get worse at longer conversations?

Explores why LLM performance drops 25 points when instructions span multiple turns instead of one message, and whether models can recover from early wrong assumptions.

Why do language models engage with conversational distractors?

Explores why state-of-the-art LLMs struggle to maintain topical focus when users introduce off-topic turns, despite having explicit scope instructions. This gap suggests models lack training signals for ignoring irrelevant directions.

Can models learn to ask genuinely useful clarifying questions?

Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.

Conversational Recommenders

8 notes

Does conversation order matter for recommending items in dialogue?

Conversational recommendation systems typically ignore the sequence in which items are mentioned, treating dialogue as a bag of entities. But does the order itself carry predictive signal about what to recommend next?

Do simulated training interactions transfer to real conversations?

Most conversational recommender systems train on simulated entity-level exchanges, not natural dialogue. The question is whether models built this way actually work when deployed with real users who speak naturally and deviate from expected patterns.

Can unified policy learning improve conversational recommender systems?

This explores whether formulating attribute-asking, item-recommending, and timing decisions as a single reinforcement learning policy outperforms treating them as separate components. The question matters because joint optimization could improve conversation quality and system scalability.

Can controlled latent variables make LLM user simulators realistic?

Can session-level and turn-level latent variables steer LLM-based user simulators toward realistic dialogue while maintaining measurable diversity and ground truth labels for training conversational systems?

What makes conversational recommenders hard to build well?

Most assume the challenge is language fluency, but what if the real problem is managing mixed-initiative dialogue—where both users and systems take turns driving the conversation?

Can language models bridge the gap between critique and preference?

When users express what they dislike rather than what they want, can LLMs reliably transform those critiques into positive preferences that retrieval systems can actually use?

Do conversational recommender benchmarks actually measure recommendation skill?

Conversational recommender systems are evaluated against ground-truth items mentioned later in conversations. But does this metric distinguish between genuinely recommending new items versus simply repeating items users already discussed?

Do recommendation strategies beyond preference questions work better?

What role do sociable conversational moves—opinion sharing, encouragement, credibility signals—play in successful human recommendations, compared to simply asking what someone likes?

Personalized Assistants

3 notes

Why do LLM judges fail at predicting sparse user preferences?

When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.

Can conversations themselves personalize without user profiles?

Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?

How do personalization granularity levels trade precision against scalability?

LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.

Speech and Voice

3 notes

Why do dialogue systems need probabilistic reasoning?

Explores whether deterministic flowchart-based dialogue systems can handle realistic speech recognition error rates of 15-30 percent, and what alternative approaches might be necessary.

Can skipping transcription make voice assistants faster?

Voice assistants traditionally convert speech to text before responding. Does eliminating that middle step reduce latency enough to matter for real-time conversation?

Do speech models learn language-specific sounds or universal physics?

Exploring whether self-supervised speech models encode phonetic categories tied to specific languages or instead capture the underlying vocal-tract physics common to all humans. This matters for understanding why these models transfer across languages without retraining.

Why do standard dialogue systems fail at tracking negotiation agreement?

Can models learn to abstain when uncertain about predictions?

Can conversation structure predict dialogue success better than content?

Can AI agents communicate efficiently in joint decision problems?

Why do dialogue systems lose context when topics return?

When should AI agents ask users instead of just searching?

Can we teach LLMs to form linguistic conventions in context?

Could proactive dialogue make conversations dramatically more efficient?

Does including all conversation history actually help retrieval?

What six problems must every conversation solve?

Why can't users articulate what they want from AI?

How do time gaps shape what people discuss across conversation sessions?

Can conversation shape predict whether it will work?

Which clarifying questions actually improve user satisfaction?

Does linguistic alignment shape how users perceive AI relationships?

Why do language models fail in gradually revealed conversations?

Why do language models lose performance in longer conversations?

Does segment-level optimization work better for multi-turn dialogue alignment?

Why do AI assistants get worse at longer conversations?

Why do language models engage with conversational distractors?

Can models learn to ask genuinely useful clarifying questions?

Does conversation order matter for recommending items in dialogue?

Do simulated training interactions transfer to real conversations?

Can unified policy learning improve conversational recommender systems?

Can controlled latent variables make LLM user simulators realistic?

What makes conversational recommenders hard to build well?

Can language models bridge the gap between critique and preference?

Do conversational recommender benchmarks actually measure recommendation skill?

Do recommendation strategies beyond preference questions work better?

Why do LLM judges fail at predicting sparse user preferences?

Can conversations themselves personalize without user profiles?

How do personalization granularity levels trade precision against scalability?

Why do dialogue systems need probabilistic reasoning?

Can skipping transcription make voice assistants faster?

Do speech models learn language-specific sounds or universal physics?

Can personas evolve in real time to match what users actually want?

Do persona consistency metrics actually measure dialogue quality?

Does abstract preference knowledge outperform specific interaction recall?

Can synthetic dialogues become realistic through layered diversity?

Why do language models respond passively instead of asking clarifying questions?