Interactive Evaluation Requires a Design Science
AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, yet many evaluation practices still inherit assumptions from response-centered benchmarks: fixed inputs, isolated outputs, and judgments made from a single response. Although interactive benchmarks have emerged, the landscape remains fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.
AI evaluation is undergoing a visible transition. For much of modern AI, benchmark design was organized around response-centered evaluation: models received fixed instances and were judged by the quality of standalone final outputs, rather than by behavior unfolding through interaction. As Figure 1 illustrates, benchmark design has increasingly expanded toward executable, grounded, and interactive settings. This shift reflects a broader change in what large language models (LLMs) are expected to do: they are increasingly evaluated not only as standalone generators, but as systems acting through tools, interfaces, environments, external databases, users, and other agents. This is not a cosmetic change in benchmark format. It changes what evidence an evaluation must observe and what claim a score can support.
This paper develops that position from the perspective of evaluation itself. We first explain why response-centered evaluation was historically useful and why its assumptions become insufficient when systems act in closed loop. We then define evaluation as an autonomous program E : X → Y, where X is the admissible evidence available to the evaluator and E is the procedure that maps that evidence to judgments. Interactive evaluation changes both parts: X expands from final responses to interaction-generated trajectories, and E must assess not only final correctness but also process quality, recoverability, coordination, safety, efficiency, and robustness. This framing lets us build a taxonomy of interactive evaluation, use it to identify where current benchmarks concentrate and what they miss, and derive principles for designing future evaluations.
In this position paper, we argue that interactive evaluation must be designed, not merely adopted. As AI systems increasingly act through consequential interactions, the field needs a systematic and unified framework for designing interactive evaluations that support comparison, reproducibility, and extension. We frame interactive evaluation as trajectory-based, system-level evaluation under action-dependent conditions, organized by two questions: what interaction artifacts enter evaluation, and how those artifacts are mapped to judgments. This framing clarifies why response-centered benchmarks remain useful but insufficient, why current interactive benchmarks should not be treated as interchangeable, and what the field must build next such as explicit protocols, richer trajectory measures, robustness tests, shared infrastructure, and reporting standards that make interactive scores interpretable. We therefore call on the community to design interactive evaluation before merely adopting it as the next benchmark format.