How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
If evaluation is the map E: X → Y from admissible evidence to judgments, then the shift to agentic systems changes both terms in a parallel, recurring way. On the evidence side (X), the unit expands from a single final response to a full interaction-generated trajectory — the sequence of states, actions, tool calls, and environment responses produced as the system acts in closed loop. On the procedure side (E), final correctness is no longer sufficient; the evaluator must additionally score process quality, recoverability (can the agent get back on track after an error?), coordination (across tools, environments, other agents), robustness, efficiency, and system-level performance.
This is a pattern, not a single metric, because the same expansion recurs across otherwise unrelated agent benchmarks. T-Eval scores whether each predicted tool call matches the expected one; AgentBoard's Progress Rate compares the actual trajectory against the expected trajectory; multi-agent frameworks score collaborative efficiency and how well agents distribute tasks dynamically. Each is an instance of "stop scoring the endpoint, start scoring the path." The trajectory becomes the evidence, and the qualities that only exist over time — recovery, coordination, partial progress — become the things judged.
Why it matters: this reframes a scattered set of agent metrics as a coherent move. Once you see process-recoverability-coordination scoring as the trajectory-level analogue of final-answer scoring, you can ask the design-science questions — which artifacts to admit, how to map them to judgments — systematically rather than benchmark by benchmark. The counterpoint: richer evidence is also noisier and harder to standardize, which is precisely why the expansion creates new evaluation challenges rather than dissolving the old ones.
— "Interactive Evaluation Requires a Design Science", https://arxiv.org/abs/2605.17829
Related concepts in this collection
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the paradigm whose evidence-and-procedure expansion this pattern describes concretely
-
Can trajectory structure replace hand-annotated process rewards?
Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
operationalizes trajectory-as-evidence for training, complementing trajectory-as-evidence for evaluation
-
Does agent interaction time scale separately from reasoning depth?
Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.
the capability side: interaction-horizon abilities are exactly what trajectory-level evaluation is needed to measure
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
agent evaluation expands evidence from final responses to interaction trajectories scoring process recoverability and coordination