WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue

Paper · arXiv 2506.01881 · Published June 2, 2025

Task-oriented dialogue systems often face difficulties when user utterances seem semantically complete but lack necessary structural information for appropriate system action. This arises because users frequently do not fully understand their own needs, while systems require precise intent definitions. Current LLM-based agents cannot effectively distinguish between linguistically complete and contextually triggerable expressions, lacking frameworks for collaborative intent formation. We present STORM, a framework modeling asymmetric information dynamics through conversations between UserLLM (full internal access) and AgentLLM (observable behavior only). STORM produces annotated corpora capturing expression trajectories and latent cognitive transitions, enabling systematic analysis of collaborative understanding development. Our contributions include: (1) formalizing asymmetric information processing in dialogue systems; (2) modeling intent formation tracking collaborative understanding evolution; and (3) evaluation metrics measuring internal cognitive improvements alongside task performance.

The rapid advancement of language models has created a fundamental challenge in human-AI interaction: the "gulf of envisioning"—users’ cognitive difficulty in formulating effective prompts. Unlike conventional interfaces with predictable affordances, language models require users to simultaneously envision possibilities and their expressions, often leading to communication breakdowns. This challenge arises from a misalignment between human cognitive processes and the way systems interpret user intent. Subramonyam et al. [1] illustrate that human intent formation involves a maturation process characterized by progressive constraint resolution, fluctuating stability intervals, and distinct structural signaling patterns. However, current evaluation methods are insufficient as they: 1) treat intent as binary rather than continuous, 2) lack frameworks for temporal coherence, and 3) overlook structural signals within expressions. These structural signals—including stylistic choices, implicit assumptions, and cultural markers reflect what Wittgenstein [2] termed the contextual embeddedness of meaning within particular “forms of life.” Current systems cannot access these embedded contextual cues that users unconsciously include in their expressions. These shortcomings constitute the Intent-Action Alignment Problem, determining precisely when user expressions have reached cognitive readiness for effective system action.

(1) A dialogue generation pipeline using two language models—UserLLM and AgentLLM—to simulate realistic conversations reflecting diverse user profiles and intent progression. UserLLM generates user behavior conditioned on comprehensive profile data and internal states, simulating authentic intent evolution, while AgentLLM responds based solely on observable dialogue history. This asymmetric setup mirrors the realistic information gaps faced by AI systems, allowing targeted studies on agent adaptability to evolving intent.

(2) A database-driven memory system that systematically tracks evolving user states (intent, emotion, satisfaction) within session-specific records. These records function as micro-databases documenting real-time intent maturation trajectories and are integrated into a global database for cross-session analysis. This structured memory approach captures the continuous nature of intent development, providing researchers with detailed, fine-grained data to study patterns across diverse interaction contexts.

(3) A web-based dialogue visualization interface equipped with a clarity rating mechanism was developed to provide an intuitive analysis of the evolution of user intent. This interface dynamically displays the refinement process of user intent, enabling researchers to assess the effectiveness of various agent response strategies visually. The tool facilitates rigorous quantitative analysis and comparison by quantifying the abstract cognitive progression into a standardized clarity metric.

We introduce a novel ‘Clarify’ metric that measures how effectively agents help users internally clarify their own intentions—assessed through analysis of simulated user inner thoughts rather than external expressions. This approach captures whether agent responses genuinely improve users’ understanding of their own needs, a crucial cognitive process often invisible in traditional dialogue evaluations.

We evaluate model performance along three complementary dimensions: (1) satisfaction derived from user inner thoughts, capturing the user’s internal contentment; (2) clarification effectiveness, measured by the Clarify metric, which is computed via prompting an evaluation model to analyze the dialogue turn-by-turn and determine whether each agent response improves the clarity of the user’s intent relative to the previous turn; and (3) Satisfaction-Seeking Actions (SSA), a composite metric that integrates satisfaction and clarification scores weighted by scenario-specific parameters to balance the competing objectives of confident response generation and appropriate clarification seeking.

satisfaction metrics demonstrate clear benefits from user profile access

This differential highlights the value of personalization in dialogue systems.

Traditional satisfaction metrics fail to capture the critical divergence between users’ expressed satisfaction and their internal confusion about their own needs. Users may express satisfaction with system responses while their inner thoughts indicate continued confusion about their own needs, highlighting the limitations of traditional evaluation metrics that rely solely on observable user feedback.

AI models exhibit fundamentally different architectural approaches to balancing response

confidence versus ambiguity recognition, with distinct trade-offs for user outcomes

The clarification-satisfaction trade-off represents a critical design choice, with Claude optimized for immediate satisfaction while Llama emphasizes long-term intent disambiguation. These patterns reveal fundamental architectural differences in how models balance response confidence versus ambiguity recognition. Claude appears optimized for satisfaction even at the cost of clarification opportunities, while Llama’s architecture seems to emphasize identifying and addressing ambiguity, sometimes trading immediate satisfaction for more effective intent disambiguation.

Successful clarification correlates more strongly with users’ internal cognitive improvement than with expressed satisfaction scores, suggesting deeper measures of dialogue effectiveness. Our analysis shows that successful clarification correlates more strongly with internal cognitive improvement than with external satisfaction scores. Users who achieve better self-understanding through interaction—as measured by clearer, more confident inner thoughts—demonstrate sustained engagement and more effective task completion, even when immediate satisfaction scores remain moderate.