Can retrieval-augmented language models forecast like human experts?

Can language models augmented with search and reasoning match or exceed the forecasting accuracy of competitive human crowd forecasters on events beyond their training data? This tests whether AI can handle genuine predictive judgment.

Synthesis note · 2026-06-03 · sourced from Reasoning Logic Internal Rules

Judgmental forecasting — assigning probabilities to future events from judgment, domain knowledge, and reasoning under distributional shift — is where humans have historically beaten statistical models, and where competitive forecasters set a high bar. This work builds a retrieval-augmented LM system that searches for relevant information, generates forecasts, and aggregates predictions, evaluated on a large dataset of questions from real forecasting competitions, tested only on questions published after the models' knowledge cutoffs (so the answers can't be memorized). The result: on average the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it — the first ML system to forecast at near-human levels. Two design pieces matter: a novel LM-driven retrieval mechanism that decides what to source and how to evaluate relevance, and a self-supervised finetuning method to generate reasonings with accurate predictions.

The keeper is twofold: scalable near-human forecasting is now feasible, and newer model generations forecast better naturally — capability rises with the base model without forecasting-specific tricks.

This is the foundational human-level-forecasting result the vault's forecasting cluster builds on. It underpins Can language models beat human venture capital experts? (VCBench) and Can LLMs actually forecast time series better than we think?, and pairs with the contamination-defense of Can live benchmarks prevent contamination in prediction tasks? (post-cutoff test sets are the same leak-proofing principle).

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Can retrieval-augmented language models forecast… Can live benchmarks prevent contamination in predi… Can language models beat human venture capital exp… Can LLMs actually forecast time series better than…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can live benchmarks prevent contamination in prediction tasks? Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.
post-cutoff testing here is the same contamination-defense FutureX operationalizes live
Can language models beat human venture capital experts? Explores whether LLMs can outperform top investors at founder success prediction in a domain where even experts show only modest accuracy. Tests whether AI forecasting is competitive in sparse-signal, high-uncertainty settings.
VCBench extends the near/surpass-human-forecasting finding to a low-base-rate domain
Can LLMs actually forecast time series better than we think? Explores whether language models possess stronger forecasting ability than current benchmarks suggest, and what role workflow design plays in revealing or hiding that capability.
both attribute forecasting gains to the search-and-aggregate workflow

Can retrieval-augmented language models forecast like human experts?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5