How can dialogue structure and trajectory predict social agent performance?
This explores whether the *shape* of a conversation — how turns unfold, who takes initiative, whether speakers converge — can tell us in advance how well a social agent will perform, rather than judging only its final answer.
This explores whether the shape of a conversation — how turns unfold, who leads, whether the two sides converge — can predict how well a social agent does, rather than scoring only its final reply. The corpus suggests dialogue trajectory is genuinely diagnostic, and it points to several distinct signals worth watching.
The most direct signal is drift. As conversations get longer, agents lose the thread — both their assigned persona and the user's original intent. Training user simulators with multi-turn RL cuts persona drift by over 55% by tracking three separate consistency signals (prompt-to-line, line-to-line, and Q&A consistency), which is really a way of saying that local turn-by-turn coherence, global cross-conversation coherence, and factual stability are *different* failure modes you can measure independently Can training user simulators reduce persona drift in dialogue?. Intent drift has its own structural cause: tool-enabled agents chain actions silently and wander from what the user wanted, and conversation analysis offers "insert-expansions" — clarifying probes mid-dialogue — as a formal marker of when a healthy trajectory should pause to check rather than barrel ahead When should AI agents ask users instead of just searching?.
A second family of signals is about initiative and efficiency. Proactive dialogue — offering relevant information unasked — cuts conversation length by up to 60% in medium-complexity tasks, so the *rate of progress per turn* is itself a performance predictor, yet this behavior is almost absent from AI benchmarks Could proactive dialogue make conversations dramatically more efficient?. That absence isn't accidental: LLMs are structurally passive, optimized to respond rather than to lead, so a flat, purely reactive trajectory is a predictable symptom of how they were trained Why can't conversational AI agents take the initiative?. If you're reading a transcript to forecast outcome, a conversation where the agent never takes the wheel is a warning sign.
A third, subtler family is convergence — whether the two parties are actually building shared understanding over time. Collaborative Rational Speech Acts model dialogue as bidirectional belief tracking, capturing the progression from partial to shared understanding that token-level systems can't see; the *trajectory toward mutual belief* becomes the thing you measure Can dialogue systems track both speakers' beliefs across turns?. Lexical entrainment is the linguistic fingerprint of this: humans drift toward each other's word choices as rapport builds, and its absence in current AI is both a quality gap and a measurable feature of a degrading trajectory Why don't conversational AI systems mirror their users' word choices?. How users *perceive* that trajectory also decomposes cleanly — competence (49% of impression variance), human-likeness (32%), and communicative flexibility (19%) — so even subjective performance has predictable structure How do users mentally model dialogue agent partners?.
The quietly surprising thread here: dialogue structure isn't just something to evaluate after the fact — it can be deliberately engineered as a performance lever. Structuring a single model's reasoning as an internal dialogue between agents beats monologue reasoning on diversity and coherence Can dialogue format help models reason more diversely?, branching non-linear prompts can replicate full multi-agent dynamics inside one model Can branching prompts replicate what multi-agent systems do?, and at the team level, swapping conversational coordination for structured shared artifacts outperforms chat-based exchange entirely Does structured artifact sharing outperform conversational coordination?. So the same structural features that *predict* performance — initiative, convergence, low drift — turn out to be the ones you can build in on purpose. And if you treat the agent as a role-playing character whose consistency is the performance metric, the trajectory of how well it stays in character becomes the most natural yardstick of all Should we treat dialogue agents as role-playing characters?.
Sources 11 notes
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
DialogueReason, which structures a single model's internal reasoning as dialogue between distinct agents in separate scenes, overcomes monologue reasoning's fixed-strategy and fragmented-attention weaknesses, especially on tasks requiring multiple problem-solving approaches.
Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.