Reasoning and Knowledge Agentic Systems and Planning

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Note · 2026-05-28 · sourced from Deep Research

There is a persistent gap between how well search agents score and how satisfied real users are, and VibeSearchBench locates its cause in the benchmarks themselves rather than the models. Three artifacts of standard benchmark design make the test unlike real search. First, over-specified queries: task constraints are exhaustively packed into one prompt, leaving the agent nothing to elicit — yet real users cannot fully articulate their needs upfront. Second, single-turn interaction: benchmarks skip the sustained back-and-forth where the hardest and most valuable work happens, namely mining the user's true intent. Third, fixed-schema outputs: results are scored against predetermined items, sets, or tables, but real knowledge relationships are too complex for rigid schemas.

The implication is that high benchmark scores can be an artifact of a test that has pre-solved the parts users actually struggle with. When the query is already complete, single-turn, and schema-matched, the agent is doing retrieval, not search; real search is collaborative refinement of vague intent. The counterpoint is that over-specified single-turn benchmarks are cheap, reproducible, and objective — they trade realism for measurability. But that trade is exactly what produces the evaluation-experience gap. This matters because it warns against trusting search-agent leaderboards as deployment signals and points to what realistic evaluation must restore: vagueness, multi-turn dialogue, and open-ended structure.


— "VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild", https://arxiv.org/abs/2605.27882

Related concepts in this collection

Concept map
16 direct connections · 166 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

search agents score well on benchmarks yet users find results unsatisfying because benchmarks use over-specified queries single turns and fixed schemas