VCBench: Benchmarking LLMs in Venture Capital

Paper · arXiv 2509.14448 · Published September 17, 2025

Abstract. Benchmarks such as SWE-bench and ARC-AGI demonstrate how shared datasets accelerate progress toward artificial general intelligence (AGI). We introduce VCBench, the first benchmark for predicting founder success in venture capital (VC), a domain where signals are sparse, outcomes are uncertain, and even top investors perform modestly. At inception, the market index achieves a precision of 1.9%. Y Combinator outperforms the index by a factor of 1.7×, while tier-1 firms are 2.9× better. VCBench provides 9,000 anonymized founder profiles, standardized to preserve predictive features while resisting identity leakage, with adversarial tests showing more than 90% reduction in re-identification risk. We evaluate nine state-of-the-art large language models (LLMs). DeepSeek-V3 delivers over six times the baseline precision, GPT-4o achieves the highest F0.5, and most models surpass human benchmarks. Designed as a public and evolving resource available at vcbench.com, VCBench establishes a community-driven standard for reproducible and privacy-preserving evaluation of AGI in early-stage venture forecasting.

Introduction. Benchmark datasets have played a defining role in the progress of machine learning (ML). By turning open-ended challenges into standardized and measurable tasks, they have enabled reproducible comparisons and driven entire fields forward. As models advance, there is growing demand for benchmarks that not only test raw accuracy but also allow systematic comparisons between machine and human performance. Venture capital (VC) is a compelling testbed for evaluating expert forecasting, offering a real-world setting to measure whether models can match or surpass human reasoning under uncertainty. Decisions rely on sparse and uncertain signals from founder backgrounds and early company data, while the financial stakes are high. Even leading investors perform modestly. The market index achieves 1.9% precision at inception, while Y Combinator reaches 3.2% (1.7× the index) and tier-1 VC firms are at 5.6% (2.9×). Recent models (Mu et al., 2025; Griffin et al., 2025) show that founder profiles alone can yield strong predictive signals, but the field lacks a standardized benchmark.

Discussion / Conclusion. We introduced VCBench, the first standardized and anonymized benchmark for founder-success prediction in venture capital. The dataset was constructed using a multistage anonymization pipeline, validated by adversarial tests, which reduced re-identification risk by over 90% while preserving predictive signal. Using this benchmark, we evaluated nine state-of-the-art LLMs and found that several outperform not only the market index but also the leading VC firms, with GPT- 4o achieving the highest F0.5 score. These results show that anonymized founder profiles are sufficient to surpass human-expert baselines in early-stage venture forecasting. By releasing both the dataset and a public leaderboard, we provide a foundation for reproducible research in this highstakes domain. VCBench is designed as a communitydriven benchmark that will evolve with feedback, richer features, and new evaluation modes, including simulation and human–AI competitions, offering a path toward more realistic tests of decision-making under uncertainty.

VCBench: Benchmarking LLMs in Venture Capital

Synthesis notes that discuss concepts related to this paper