Can schema-free graphs objectively evaluate open-ended search?
Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.
Open-ended search evaluation faces a dilemma. Fixed-schema scoring — against items, sets, or tables — is objective and stable but cannot represent the complex, irregular knowledge structures real search produces. Free-text evaluation captures that richness but requires rubric design that is subjective and unstable. VibeSearchBench's resolution is a schema-free ground-truth knowledge graph: a directed graph carries no preset structure, so it can model arbitrary relationships relevant to the search intent, yet because it is a graph it still supports fine-grained, objectively verifiable matching. Each task pairs a user persona with such a graph and is scored through a graph-matching framework, escaping both horns of the dilemma.
The pattern generalizes beyond search: whenever the target output is structured but its structure cannot be fixed in advance, a graph ground truth plus graph-matching evaluation offers objectivity without rigidity. The cost is that constructing high-quality ground-truth graphs is labor-intensive — VibeSearchBench's 200 tasks were manually curated — and graph-matching introduces its own scoring choices. The counterpoint is that even with this method the best model reaches only 30.30 F1, partly because models produce structurally flat graphs; the evaluation is demanding precisely because it is faithful. This matters because it provides a reusable template for evaluating any open-ended generation task whose correct answer is a web of relations rather than a list.
— "VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild", https://arxiv.org/abs/2605.27882
Related concepts in this collection
-
Can knowledge graphs generate training data for search agents?
Exploring whether synthesizing questions from knowledge graph random walks with entity blurring can create the hard-to-find training data needed to teach deep search agents to reason and search effectively.
both use knowledge graphs as the substrate for evaluating or training open-ended search
-
Can agents evaluate AI outputs more reliably than language models?
Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.
an alternative route to objective evaluation of complex open-ended outputs
-
Why do search agents fail users despite strong benchmark scores?
Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?
grounds: names the fixed-schema failure that the schema-free graph is designed to escape — it diagnoses the problem this pattern solves
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
synthesizes: both are instances of expanding evaluation evidence beyond a flat final answer — here the evidence is a relational graph rather than a trajectory
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
a schema-free ground-truth knowledge graph enables objective evaluation of open-ended search