Approaching Human-Level Forecasting with Language Models

Paper · arXiv 2402.18563 · Published February 28, 2024
Logical Reasoning and Internal Rules

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

Introduction. Forecasting events is important in the modern world. Governments rely on economic and geopolitical forecasts for decision-making. Companies hire and invest based on forecasts of market conditions (Armstrong, 2001). In 2020, epidemiological forecasts for COVID-19 prompted national lockdowns across the globe (Adam, 2020). There are two main approaches to forecasting. Statistical forecasting primarily uses tools from time-series modeling. This methodology typically excels when data are abundant and under minimal distributional shift. By contrast, in judgmental forecasting, human forecasters assign probabilities to future events based on their own judgments, making use of historical data, domain knowledge, Fermi estimates, and intuition. They draw information from diverse sources and reason based on detailed contexts of the task. This enables accurate forecasts even with scarce past observations or under significant distributional shift (Tetlock and Gardner, 2015). We will refer to judgmental forecasting simply as “forecasting”.

Discussion / Conclusion. Our work presents the first ML system that can forecast at near human levels. We develop a novel retrieval mechanism that uses a LM to determine which information to source and how to evaluate its relevance. We also give a self-supervised fine-tuning method to generate reasonings with accurate predictions. To facilitate further research, we release our dataset: the largest and most recent forecasting dataset compiled from 5 real-world forecasting competitions. We discuss a few opportunities to improve these systems further. LMs get better at forecasting naturally. We observe that as LMs improve, they naturally also become better at forecasting. In particular, in Section 3.4, we see that newer generations of models forecast better than older ones. For example, GPT-4-1106, released in 2023, outperforms GPT-4-0613, released in 2021, by .02 with respect to the Brier score. If we were to have fine-tuned the more recent model, we would expect better performance. At a high level, our results suggest that in the near future, LM-based systems may be able to generate accurate forecasts at the level of competitive human forecasters. We hope that our work paves the way for automated, scalable forecasting that can help to inform institutional decision making.