Reasoning and Knowledge Agentic Systems and Planning AI Social Psychology

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?

Note · 2026-05-28 · sourced from Evaluations

The seductive promise of interactive evaluation is that richer evidence solves the problems of response-centered benchmarks. The position paper resists this. Its analysis shows that longstanding evaluation challenges — comparability, reproducibility, the validity of the evidence-to-judgment mapping, what claims a score actually supports — reappear at the trajectory level rather than disappearing. Scoring a path instead of an endpoint does not escape the core difficulty of evaluation; it relocates it into a higher-dimensional space where it is, if anything, harder to pin down.

This is why the paper frames the situation as a question demanding design, not a solution already in hand. A trajectory admits many scoring choices, and different interactive benchmarks make incompatible ones, so their results are not interchangeable — the same fragmentation that response benchmarks eventually had to standardize away, now recurring with more degrees of freedom. Process quality, recoverability, and coordination are genuinely informative, but each introduces its own version of the old questions: what counts as evidence, how is it aggregated into a judgment, and what does the resulting number license you to claim?

Why it stays open: the honest reading is that interactive evaluation buys richer evidence at the cost of reintroducing every hard problem at a new scale. The field's task is therefore not to adopt the format but to build the protocols, robustness tests, shared infrastructure, and reporting standards that make trajectory scores interpretable — work that is unfinished. Treating the new paradigm as a fix would repeat the mistake; treating it as a design problem is the corrective the paper argues for.

— "Interactive Evaluation Requires a Design Science", https://arxiv.org/abs/2605.17829

Related concepts in this collection

Should interactive evaluation be designed as a unified paradigm? As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the reappearance of old challenges is the central motivation for designing rather than adopting
How should we evaluate agent behavior beyond final answers? Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the evidence expansion that creates the higher-dimensional space where old problems recur
Should we evaluate deployed agents as whole environments instead? Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?
extends: enlarging the unit of evaluation is precisely what reintroduces comparability and reproducibility problems at the new scale
What should we actually measure in agent evaluation? Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
exemplifies the new dimensions (memory hygiene, verification cost) that each carry their own version of the old evidence-to-judgment questions

Concept map

12 direct connections · 115 in 2-hop network ·dense cluster Open in graph ↗

Do interactive evaluations actually solve the be… Should interactive evaluation be designed as a uni… How should we evaluate agent behavior beyond final… Should we evaluate deployed agents as whole enviro… What should we actually measure in agent evaluatio…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

longstanding evaluation challenges reappear at the trajectory level rather than disappearing

Do interactive evaluations actually solve the benchmark comparison problem?

Related concepts in this collection

Related papers in this collection