INQUIRING LINE

Do gains from harness-based agents transfer across different search benchmarks?

This explores whether the performance boost a search agent gets from an external 'harness' (offloaded bookkeeping/scaffolding) is a real, portable capability — or just a trick tuned to one benchmark that evaporates on unfamiliar tasks.


This explores whether the performance boost a search agent gets from an external 'harness' — scaffolding that holds the agent's working state and bookkeeping outside the model — actually carries over to search benchmarks it wasn't trained on. The corpus has a direct answer, and it's encouraging: a 20B model using Harness-1 hit 0.730 average curated recall across eight benchmarks, beating the next open searcher by 11.4 points, and crucially those gains transferred to held-out benchmarks and survived ablation Can externalizing bookkeeping improve search agent performance?. That last detail matters: when something survives ablation and shows up on tasks outside the training set, it's evidence the harness is a learned capability the model genuinely acquired, not a benchmark-specific implementation detail.

But 'transfers across benchmarks' deserves a skeptical second look, because the corpus also shows that search benchmarks themselves are a shaky yardstick. High benchmark scores routinely fail to predict whether real users are satisfied, because benchmarks lean on over-specified queries, single-turn interactions, and fixed schemas — they measure retrieval, not the messier collaborative work of refining what a person actually wants Why do search agents fail users despite strong benchmark scores?. So a gain can transfer across eight benchmarks and still not transfer to the thing benchmarks are standing in for. Transfer across benchmarks and transfer to the real world are two different claims.

This connects to a broader argument in the corpus that we're measuring the wrong thing entirely. Single-score task success collapses multi-dimensional agent behavior and breeds false confidence in deployment readiness; what we actually want to track is trajectory quality, memory hygiene, context efficiency, and verification cost agent-evaluation-must-move-beyond-one-shot-task-success-to-trajectory-mea. Read alongside the harness result, this reframes *why* externalizing bookkeeping transfers so well: the harness is improving exactly those underlying dimensions — clean state, low context bloat — which are benchmark-agnostic, rather than gaming any one scoreboard.

There's also a cautionary thread about whether harness gains are stable as systems get more elaborate. An eight-module agentic evaluator cut judge error by 100x, but its memory module cascaded errors, showing that added scaffolding can both lift performance and introduce new failure surfaces that need error isolation Can agents evaluate AI outputs more reliably than language models?. The lesson cross-applies to search harnesses: more state to externalize means more state that can corrupt and propagate.

If you want to keep pulling this thread, the corpus also reframes search itself as a test-time compute axis — search budget scales like reasoning tokens, so a harness that buys more efficient search steps is buying the same kind of scaling leverage you'd get from more reasoning Does search budget scale like reasoning tokens for answer quality?. The thing that doesn't transfer cleanly is exploration diversity: RL training compresses search agents into narrow reward-maximizing strategies through the same entropy collapse seen in reasoning, which is exactly the kind of overfitting that *would* hurt cross-benchmark generalization unless you preserve diversity with SFT on varied demonstrations Does reinforcement learning squeeze exploration diversity in search agents?. So the honest synthesis: structural harness gains transfer; reward-narrowed behaviors don't.


Sources 6 notes

Can externalizing bookkeeping improve search agent performance?

A 20B model using Harness-1 achieved 0.730 average curated recall across eight benchmarks, outperforming the next open searcher by 11.4 points. The gains transfer to held-out benchmarks and survive ablation, showing the harness is not mere implementation but a learned capability.

Why do search agents fail users despite strong benchmark scores?

Search benchmarks use over-specified queries, single-turn interactions, and fixed schemas—none of which match real search. These design choices make benchmarks measure retrieval, not collaborative intent refinement, explaining why high scores don't predict user satisfaction.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Next inquiring lines