VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Paper · arXiv 2605.27882
Deep Research AgentsQuestion Answering and SearchLLM AgentsLLM Evaluations and Benchmarks

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation–experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

A fundamental reason is the mismatch between how benchmarks frame search tasks and how users actually search. In practice, most users do not, and indeed cannot, fully articulate their information needs upfront. A realistic search session unfolds as an iterative user-agent interaction: (USER) a vague query →(AGENT) partial results and clarification →(USER) expresses emerging preferences and needs →(AGENT) adjusts its search direction →(user-agent interaction) ... →the information need gradually converges into a concrete solution. We term this class of tasks VibeSearch. Existing mainstream search benchmarks fail to capture the VibeSearch paradigm in three critical ways. (1) Over-specified queries. Task constraints are exhaustively and explicitly packed into a single prompt, leaving no room for the agent to actively elicit user intent. (2) Single-turn interaction. Current benchmarks do not support sustained user-agent interaction, thereby skipping the most challenging and valuable step in VibeSearch: proactively and continuously mining the user's true search intent. (3) Fixed-schema outputs and evaluation. Outputs are evaluated against predetermined structures such as items, sets, or tables. However, real-world knowledge relationships are inherently complex, and user search intent is difficult to model with rigid schemas.

We argue that effective VibeSearch systems should adhere to two principles. First, search should be a process of bidirectional convergence, not unidirectional answering. Users often cannot articulate their preferences until they have seen some relevant information; the agent should therefore interleave returning partial results with asking follow-up questions, co-evolving vague needs into concrete solutions with the user, rather than following a "clarify first, search later" two-stage pipeline. Second, outputs and evaluation should be grounded in schema-free structured information. Fixed-schema evaluation, while objective and stable, is misaligned with the complex knowledge structures found in the real world; free-text evaluation requires rubric design that is inherently subjective and unstable. We observe that a directed graph without any preset schema can model arbitrary target information relevant to the search intent, while still enabling fine-grained, objectively verifiable evaluation.

We introduced VibeSearchBench, a benchmark for evaluating LLM agents on long-horizon proactive search, where agents must collaboratively refine vague user intent through multi-turn interaction and produce schema-free information graphs. Evaluation of seven frontier models under both ReAct and OpenClaw shows that even the best model achieves only 30.30 F1, with context overflow, inefficient intent elicitation, and structurally flat knowledge graph outputs identified as key bottlenecks. Ablation further confirms that architectural enhancements (sub-agents, local memory, life-long memory) yield no meaningful gains. Moreover, the inconsistent framework effects across models (e.g., OpenClaw improves Claude but leaves Kimi unchanged) underscore that optimizing for widely adopted agent harnesses is critical for real-world deployment.