Are LLM and agent benchmarks really measuring different things?
Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.
The benchmark literature has bifurcated. There are "LLM benchmarks" (MMLU, GPQA, math, code) that test the model in pure-completion mode, and "agent benchmarks" (SWE-bench, WebArena, OSWorld) that test the model in tool-using, multi-step, environment-coupled mode. These have grown into nearly disjoint research communities.
The DELEGATE-52 authors argue this is a category error. The two benchmark families do not measure two different artifacts — they measure two different operating modes of the same artifact. A model is not a "good LLM" or a "good agent" in isolation. It is a model whose behavior is conditioned on whether it is asked to produce one answer or to operate through a tool loop. The same underlying weights respond differently in the two modes, and a model can be strong in one and weak in the other for reasons that are about mode-specific calibration rather than mode-specific intelligence.
The methodological consequence: characterizing a model honestly requires evaluation across modes. A model that scores 90 on MMLU and 30 on a long-horizon agent task is not "a 90 model with an agent problem to solve" — it is a model whose capability has two numbers, and the deployment context decides which one matters.
For builders, this argues against treating "agent capability" as a separate research target to be optimized after general capability. The two modes interact. Agentic deployment surfaces failures that completion-mode benchmarks cannot see, and completion-mode strengths do not transport cleanly to agentic settings.
Related concepts in this collection
-
Do short benchmarks predict how models perform over long workflows?
Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
same paper, the relay-length specific case
-
Do frontier LLMs silently corrupt documents in long workflows?
Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
same paper, the empirical mode-divergence
-
When do multi-agent systems actually outperform single agents?
As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.
adjacent methodology: single vs multi-agent comparison
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
LLM and agent benchmarks are two modes of the same model not separate fields