FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

Paper · arXiv 2508.11987 · Published August 16, 2025
LLM Evaluations and Benchmarks

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models.

Introduction. The rapid evolution of Large Language Models (LLMs) has catalyzed a fundamental shift in the landscape of artificial intelligence, moving from the generation of coherent text to the creation of autonomous agents capable of complex, goal-oriented behavior [1, 2, 24, 26, 29, 40]. This transition from passive text generators to active problem-solvers necessitates a corresponding evolution in evaluation methodologies. While foundational benchmarks like MMLU [10] and SuperGLUE [31] are instrumental in assessing the static knowledge of LLMs, they are insufficient for measuring what a model can do when deployed as part of an interactive, goal-seeking system. An agent’s performance is defined not just by its underlying model, but by its ability to plan, use external tools, and adapt to a dynamic environment. In response, a new generation of agent-centric benchmarks has emerged, primarily focused on evaluating search, tool usage, and coding skills in controlled or simulated settings.

Discussion / Conclusion. FutureX is the first live benchmark that tests LLM agents on real-world future prediction tasks by continuously collecting questions from 195 trusted sites, gathering model predictions at each event’s start date, and then automatically checking the actual outcomes. In our study of 25 different models—from base LLMs to search-and-reasoning agents and deep-research models—we find that strong base models like DouBao-Seed1.6 handle straightforward questions well, but tackling more complex, open-ended predictions requires models with built-in search and reasoning. In particular, Grok-4 and GPT-o4-mini (Think&Search) stands out on the hardest tasks, balancing speed and accuracy. Going forward, FutureX offers a flexible platform for improving LLM agents. We are actively working on adding new domains and data sources to FutureX. By keeping the benchmark live and diverse, we aim to push agents closer to the level of human experts in making timely, strategic predictions across a wide range of fields.