Can retrieval-augmented language models forecast like human experts?
Can language models augmented with search and reasoning match or exceed the forecasting accuracy of competitive human crowd forecasters on events beyond their training data? This tests whether AI can handle genuine predictive judgment.
Judgmental forecasting — assigning probabilities to future events from judgment, domain knowledge, and reasoning under distributional shift — is where humans have historically beaten statistical models, and where competitive forecasters set a high bar. This work builds a retrieval-augmented LM system that searches for relevant information, generates forecasts, and aggregates predictions, evaluated on a large dataset of questions from real forecasting competitions, tested only on questions published after the models' knowledge cutoffs (so the answers can't be memorized). The result: on average the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it — the first ML system to forecast at near-human levels. Two design pieces matter: a novel LM-driven retrieval mechanism that decides what to source and how to evaluate relevance, and a self-supervised finetuning method to generate reasonings with accurate predictions.
The keeper is twofold: scalable near-human forecasting is now feasible, and newer model generations forecast better naturally — capability rises with the base model without forecasting-specific tricks.
This is the foundational human-level-forecasting result the vault's forecasting cluster builds on. It underpins Can language models beat human venture capital experts? (VCBench) and Can LLMs actually forecast time series better than we think?, and pairs with the contamination-defense of Can live benchmarks prevent contamination in prediction tasks? (post-cutoff test sets are the same leak-proofing principle).
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do newer language model generations improve forecasting ability without additional training?
- Why are post-cutoff test sets essential for evaluating genuine forecasting ability?
- What role does retrieval mechanism design play in forecast accuracy?
- How do AI researcher forecasts compare across different timeline question phrasings?
- How do search and reasoning workflows improve forecasting performance over base models?
- Can language models match competitive crowd forecasters on real future events?
- How much does domain expertise actually improve human forecasting under uncertainty?
- What privacy-preserving evaluation methods best capture real-world forecasting ability?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can live benchmarks prevent contamination in prediction tasks?
Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.
post-cutoff testing here is the same contamination-defense FutureX operationalizes live
-
Can language models beat human venture capital experts?
Explores whether LLMs can outperform top investors at founder success prediction in a domain where even experts show only modest accuracy. Tests whether AI forecasting is competitive in sparse-signal, high-uncertainty settings.
VCBench extends the near/surpass-human-forecasting finding to a low-base-rate domain
-
Can LLMs actually forecast time series better than we think?
Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
both attribute forecasting gains to the search-and-aggregate workflow
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Approaching Human-Level Forecasting with Language Models
- Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Active Retrieval Augmented Generation
- Provable Benefits of In-Tool Learning for Large Language Models
- Nexus: An Agentic Framework for Time Series Forecasting
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- Retrieval-augmented reasoning with lean language models
Original note title
a retrieval-augmented LM system forecasts future events near the level of competitive human crowd forecasters