Can language models beat human venture capital experts?
Explores whether LLMs can outperform top investors at founder success prediction in a domain where even experts show only modest accuracy. Tests whether AI forecasting is competitive in sparse-signal, high-uncertainty settings.
Venture capital is a clean testbed for expert forecasting under uncertainty: signals are sparse, outcomes uncertain, and even top investors perform modestly in absolute terms. At inception the market index achieves only 1.9% precision; Y Combinator reaches ~3.2% (1.7× the index) and tier-1 firms ~5.6% (2.9×). VCBench standardizes 9,000 anonymized founder profiles (with adversarial tests cutting re-identification risk >90% while preserving predictive signal) and evaluates nine LLMs. Several surpass the human baselines — DeepSeek-V3 delivers over six times the index precision, GPT-4o achieves the highest F0.5 — and most models beat the human benchmarks.
The keeper is the structural point about where LLMs win: in low-base-rate, sparse-signal forecasting, modest absolute accuracy can still beat expert humans because the human bar is itself modest. This reframes "can AI match experts?" — in domains where expertise yields only a small edge over chance, the bar to exceed experts is correspondingly low, and anonymized profile features alone suffice.
This complements the vault's forecasting thread. Since Can LLMs actually forecast time series better than we think?, VCBench supplies the domain where even raw model capability clears a low human bar; and it extends Can retrieval-augmented language models forecast like human experts? — that result nears the competitive-crowd bar; VCBench shows that where the human bar is modest, LLMs clear it outright.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do newer language model generations improve forecasting ability without additional training?
- Why are post-cutoff test sets essential for evaluating genuine forecasting ability?
- What role does retrieval mechanism design play in forecast accuracy?
- Can language models match competitive crowd forecasters on real future events?
- How much does domain expertise actually improve human forecasting under uncertainty?
- What privacy-preserving evaluation methods best capture real-world forecasting ability?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLMs actually forecast time series better than we think?
Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
VCBench is a domain where the modest human bar makes LLM forecasting competitive
-
Do automated benchmarks hide what frontier AI systems can really do?
Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?
VCBench is a real-stakes, privacy-preserving benchmark in the open-world spirit
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- VCBench: Benchmarking LLMs in Venture Capital
- Approaching Human-Level Forecasting with Language Models
- Large language models surpass human experts in predicting neuroscience results
- AI-Powered (Finance) Scholarship
- Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
- Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
- The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
- Linguistic Calibration of Long-Form Generations
Original note title
in domains where expert humans perform only modestly LLMs can surpass human-expert baselines — sparse-signal forecasting rewards modest absolute accuracy