Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
AI evaluation is undergoing a structural change: models are increasingly deployed as systems that act over time through tools, environments, users, and other agents. Yet most evaluation practice still inherits response-centered assumptions — fixed inputs, isolated outputs, a judgment made from a single response. Interactive benchmarks have proliferated, but the landscape is fragmented: they disagree on what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This paper's position is that interactive evaluation should be treated as a principled paradigm, not as the next family of agent benchmarks to collect.
The argument turns on a definition: evaluation is an autonomous mapping E: X → Y from admissible evidence X to judgments Y. Interactive evaluation changes both sides. The evidence X expands from final responses to interaction-generated trajectories; the procedure E must assess not just final correctness but process quality, recoverability, coordination, safety, efficiency, and robustness. From this the authors build a two-axis taxonomy (what artifacts enter; how they map to judgments), derive design principles and reporting standards, and locate where current benchmarks concentrate and what they miss.
Why it matters: the distinction between designing and adopting is the whole point. Adopting interactive benchmarks one at a time produces incomparable, non-reproducible, non-extensible scores — the same fragmentation that plagued early benchmark culture, now at the trajectory level. Treating interactive evaluation as a design science forces explicit protocols, richer trajectory measures, shared infrastructure, and reporting standards that make scores interpretable. The counterpoint the paper concedes: response-centered evaluation remains useful — it is insufficient, not wrong — so the paradigm shift is additive, expanding what counts as evidence rather than discarding the old measures.
— "Interactive Evaluation Requires a Design Science", https://arxiv.org/abs/2605.17829
Related concepts in this collection
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the concrete shift in evidence and scoring that this paradigm formalizes
-
Do interactive evaluations actually solve the benchmark comparison problem?
Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?
why a new format does not escape old problems, motivating the design-science framing
-
Will AI automation eventually formalize designer taste?
Designers argue taste is the irreducible human element AI cannot replicate. But does the same automation pattern that formalized other skilled work suggest taste itself will become the next layer to be encoded into evaluation systems?
both treat evaluation design itself as the contested, formalizable layer of the work
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
interactive evaluation must be designed as a paradigm not adopted as the next benchmark format