UserBench: An Interactive Gym Environment for User-Centric Agents

Paper · arXiv 2507.22034 · Published July 29, 2025

Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.

This motivates our central research question: How can we evaluate agents from a user-centric perspective? To answer this, we first examine how users typically communicate goals. Human communication is inherently a joint activity, where meaning is co-constructed through interaction (Clark, 1996). Moreover, language is inherently ambiguous, making it difficult for users to fully and clearly convey their intent in a single interaction. (Liu et al., 2023). As such, user instructions tend to share three core traits: (i) Underspecification: users often initiate requests before fully formulating their goals; (ii) Incrementality: intent emerges and evolves across interaction; and (iii) Indirectness: users may obscure or soften their true intent due to social or strategic reasons.

In UserBench, simulated users provide initial vague task instruction (underspecification), gradually reveal preferences over time (incrementality), and often do so implicitly (indirectness). Agents must proactively clarify goals, interpret subtle cues, and adaptively reason through tool use to succeed.

For instance, scores drop by over 40% on average when models are restricted to selecting only one option per traveling aspect in UserBench, revealing their difficulty in making optimal decisions. Moreover, models provide answers that fully align with all user intents only 20% of the time on average, and even the bestperforming models elicit less than 30% of all user preferences through active querying, suggesting limited ability to engage in purposeful, user-driven dialogue.

Complementing this, other benchmarks such as MINT (Wang et al., 2024b), PrefEval (Zhao et al., 2025), τ-Bench (Yao et al., 2024), and τ 2-Bench (Barres et al., 2025) focus on dynamic, multi-turn interactions, testing whether agents can incorporate feedback, handle evolving preferences, and maintain user alignment over time. However, several limitations remain. For example, a large portion of interactions are automatically synthesized in these benches, often resulting in overly lengthy and unnatural goal formulations that diverge from how users typically express themselves. In contrast, UserBench explicitly models the three core interactive traits of user communication that are critical for real-world alignment.

User-centric agent designs. Designing agents that collaborate effectively with users requires modeling ambiguity, evolving intent, and user-specific preferences. Standard instruction-tuned models often hallucinate intent or avoid asking clarifying questions. Recent work aims to teach agents to proactively clarify underspecified instructions. For instance, models trained with simulated ambiguity resolution (Zhang et al., 2024c; Chen et al., 2025) better recognize when to ask versus answer. Other work tackles user modeling directly: TravelPlanner+ (Singh et al., 2024) and PRELUDE (Gao et al., 2024) build agents that personalize responses using either explicit profiles or latent preferences inferred from user edits. While these designs make progress toward adaptive, user-aware agents, they often target narrow personalization tasks or rely on static user models; in contrast, our work provides a general-purpose evaluation environment that systematically tests an agent’s ability to dynamically uncover and respond to emergent user intent across varied and customizable interaction scenarios.