Agentic Systems and Planning AI Social Psychology

Should interactive evaluation be designed as a unified paradigm?

As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?

Note · 2026-05-28 · sourced from Evaluations

AI evaluation is undergoing a structural change: models are increasingly deployed as systems that act over time through tools, environments, users, and other agents. Yet most evaluation practice still inherits response-centered assumptions — fixed inputs, isolated outputs, a judgment made from a single response. Interactive benchmarks have proliferated, but the landscape is fragmented: they disagree on what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This paper's position is that interactive evaluation should be treated as a principled paradigm, not as the next family of agent benchmarks to collect.

The argument turns on a definition: evaluation is an autonomous mapping E: X → Y from admissible evidence X to judgments Y. Interactive evaluation changes both sides. The evidence X expands from final responses to interaction-generated trajectories; the procedure E must assess not just final correctness but process quality, recoverability, coordination, safety, efficiency, and robustness. From this the authors build a two-axis taxonomy (what artifacts enter; how they map to judgments), derive design principles and reporting standards, and locate where current benchmarks concentrate and what they miss.

Why it matters: the distinction between designing and adopting is the whole point. Adopting interactive benchmarks one at a time produces incomparable, non-reproducible, non-extensible scores — the same fragmentation that plagued early benchmark culture, now at the trajectory level. Treating interactive evaluation as a design science forces explicit protocols, richer trajectory measures, shared infrastructure, and reporting standards that make scores interpretable. The counterpoint the paper concedes: response-centered evaluation remains useful — it is insufficient, not wrong — so the paradigm shift is additive, expanding what counts as evidence rather than discarding the old measures.


— "Interactive Evaluation Requires a Design Science", https://arxiv.org/abs/2605.17829

Related concepts in this collection

Concept map
13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

interactive evaluation must be designed as a paradigm not adopted as the next benchmark format