Do interactive evaluations actually solve the benchmark comparison problem?
Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?
The seductive promise of interactive evaluation is that richer evidence solves the problems of response-centered benchmarks. The position paper resists this. Its analysis shows that longstanding evaluation challenges — comparability, reproducibility, the validity of the evidence-to-judgment mapping, what claims a score actually supports — reappear at the trajectory level rather than disappearing. Scoring a path instead of an endpoint does not escape the core difficulty of evaluation; it relocates it into a higher-dimensional space where it is, if anything, harder to pin down.
This is why the paper frames the situation as a question demanding design, not a solution already in hand. A trajectory admits many scoring choices, and different interactive benchmarks make incompatible ones, so their results are not interchangeable — the same fragmentation that response benchmarks eventually had to standardize away, now recurring with more degrees of freedom. Process quality, recoverability, and coordination are genuinely informative, but each introduces its own version of the old questions: what counts as evidence, how is it aggregated into a judgment, and what does the resulting number license you to claim?
Why it stays open: the honest reading is that interactive evaluation buys richer evidence at the cost of reintroducing every hard problem at a new scale. The field's task is therefore not to adopt the format but to build the protocols, robustness tests, shared infrastructure, and reporting standards that make trajectory scores interpretable — work that is unfinished. Treating the new paradigm as a fix would repeat the mistake; treating it as a design problem is the corrective the paper argues for.
— "Interactive Evaluation Requires a Design Science", https://arxiv.org/abs/2605.17829
Related concepts in this collection
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the reappearance of old challenges is the central motivation for designing rather than adopting
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the evidence expansion that creates the higher-dimensional space where old problems recur
-
Should we evaluate deployed agents as whole environments instead?
Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?
extends: enlarging the unit of evaluation is precisely what reintroduces comparability and reproducibility problems at the new scale
-
What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
exemplifies the new dimensions (memory hygiene, verification cost) that each carry their own version of the old evidence-to-judgment questions
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
longstanding evaluation challenges reappear at the trajectory level rather than disappearing