What makes a trajectory score interpretable across different interactive benchmarks?

This explores what it actually takes for a single score over an agent's full run — its trajectory of actions — to mean the same thing when you carry it from one interactive benchmark to another.

This reads the question as: when you reduce a whole sequence of agent actions to a number, what has to be true for that number to be comparable and trustworthy across different interactive setups? The corpus's blunt answer is that interactive evaluation doesn't make this easier — it makes it harder. Moving from one-shot answers to full trajectories doesn't dissolve the old problems of comparability, reproducibility, and mapping evidence to a judgment; it relocates them into a higher-dimensional space where they're harder to see Do interactive evaluations actually solve the benchmark comparison problem?. Interpretability isn't a property of the format; it has to be engineered through shared design protocols and standards.

The first thing the corpus says you need is to stop pretending one number can carry the load. Agent capability turns out to be a vector — task success, privacy compliance, long-horizon retention, behavior under mode shifts, ecosystem readiness — and models that top one axis routinely sink on another, so a single collapsed score is systematically misleading about real deployment Does a single benchmark score actually predict agent readiness?. A trajectory score is only interpretable once you know which axis it's scoring.

The second thing is being clear about what the score is anchored to. There's a sharp split between scoring the final outcome and scoring the reasoning trace itself. Grading only the verifiable end-state against deterministic ground truth keeps the number honest; grading the trace rewards stylistic mimicry and inflates results — in one case a 20% real ceiling would look much higher under trace-based grading Should reasoning benchmarks score final answers or reasoning traces?. Anchoring to something verifiable is what lets a score travel between benchmarks without quietly changing meaning.

But there's a tension worth surfacing: outcome-only anchoring is robust, yet a lot of the signal you'd want lives inside the trajectory. The corpus shows you can extract dense, interpretable structure from the trajectory itself — tree topology, expert-aligned actions, tool-call positions all convert sparse outcome signals into legible step-level signals without hand-annotation Can trajectory structure replace hand-annotated process rewards? — and that local, step-level confidence catches reasoning breakdowns that a single averaged score smooths over entirely Does step-level confidence outperform global averaging for trace filtering?. So interpretability is partly about granularity: where you aggregate determines what you can see.

The quiet warning underneath all of this: a clean-looking number can be measuring the wrong thing. Benchmark gains can be memorization rather than capability — a model reconstructing test items from partial prompts scores well on contaminated sets and collapses on fresh ones Does RLVR success on math benchmarks reflect genuine reasoning improvement? — and genuine behavioral activation and benchmark improvement are separable phenomena that can coexist without either implying the other Can genuine reasoning activation coexist with contaminated benchmarks?. The thing you didn't know you wanted to know: a trajectory score is interpretable across benchmarks not when it's precise, but when you can say exactly which axis it measures, what verifiable thing it's anchored to, at what granularity it aggregates, and whether the benchmark itself is clean. Drop any one of those and the number stops meaning the same thing the moment it crosses to a new benchmark.

Sources 7 notes

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

What makes a trajectory score interpretable across different interactive benchmarks?

Sources 7 notes

Next inquiring lines