Agentic Systems and Planning

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Note · 2026-05-18 · sourced from Flaws

The benchmark literature has bifurcated. There are "LLM benchmarks" (MMLU, GPQA, math, code) that test the model in pure-completion mode, and "agent benchmarks" (SWE-bench, WebArena, OSWorld) that test the model in tool-using, multi-step, environment-coupled mode. These have grown into nearly disjoint research communities.

The DELEGATE-52 authors argue this is a category error. The two benchmark families do not measure two different artifacts — they measure two different operating modes of the same artifact. A model is not a "good LLM" or a "good agent" in isolation. It is a model whose behavior is conditioned on whether it is asked to produce one answer or to operate through a tool loop. The same underlying weights respond differently in the two modes, and a model can be strong in one and weak in the other for reasons that are about mode-specific calibration rather than mode-specific intelligence.

The methodological consequence: characterizing a model honestly requires evaluation across modes. A model that scores 90 on MMLU and 30 on a long-horizon agent task is not "a 90 model with an agent problem to solve" — it is a model whose capability has two numbers, and the deployment context decides which one matters.

For builders, this argues against treating "agent capability" as a separate research target to be optimized after general capability. The two modes interact. Agentic deployment surfaces failures that completion-mode benchmarks cannot see, and completion-mode strengths do not transport cleanly to agentic settings.

Related concepts in this collection

Concept map
13 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

LLM and agent benchmarks are two modes of the same model not separate fields