How should evaluation frameworks account for the computational cost of frontier AI capability?

This explores whether measuring what frontier AI can do should also measure what that capability costs to produce — and how a few corners of the corpus quietly treat cost as part of the score rather than a footnote.

This explores whether measuring what frontier AI can do should also measure what that capability *costs* to produce. The short version from the corpus: most benchmarks report a capability number and stay silent on the compute behind it, and that silence distorts the picture in both directions. The clearest counter-practice is open-world evaluation, where long, messy, real-world tasks are graded by reading the logs — and cost is reported explicitly alongside the result Do automated benchmarks hide what frontier AI systems can really do?. Once you put cost on the same line as capability, the standard leaderboard view starts to look incomplete: a model that 'passes' by burning enormous inference is not the same achievement as one that passes cheaply, and a benchmark that hides that is overstating what's actually deployable.

The trickier finding is that compute and capability aren't even on the same axis. One study shows non-reasoning models can't close the gap with reasoning models *no matter how much inference compute you throw at them* — because the reasoning protocol is installed during training, extra test-time tokens only pay off if the model was trained to use them Can non-reasoning models catch up with more compute?. So an evaluation framework that treats compute as a single dial ('give it more budget, get more capability') is measuring the wrong thing. Cost has to be split between training cost and inference cost, because they buy fundamentally different kinds of capability. The same training-vs-inference tradeoff shows up in how knowledge gets into a model at all: RAG adds latency every query, static embedding is fast but expensive to build and rigid, adapters split the difference — each 'method' is really a different cost structure for the same apparent capability How do knowledge injection methods trade off flexibility and cost?.

What's easy to miss is that *evaluation itself* is now a frontier-cost problem, not just the thing being evaluated. Agent-based judges with evidence collection cut judging error a hundredfold over a plain LLM-as-judge — but they do it by running an eight-module agentic pipeline, which is its own compute bill Can agents evaluate AI outputs more reliably than language models?. So the framework faces a recursive version of the same question: how much compute is it worth spending to *measure* capability accurately? The same logic governs human oversight as a cost: targeted intervention at a few high-leverage decision points beat both full autonomy and exhaustive step-by-step review, because constant oversight is expensive *and* degrades the work Does targeted human intervention outperform both full autonomy and exhaustive oversight?. Oversight is a cost line too, and more of it isn't strictly better.

The thing you might not have known you wanted to know: cost accounting may be the only defense against capability outrunning your ability to check it. When AI generates knowledge faster than humans can verify it, confidence in the whole system collapses — and it self-reinforces because the verification tools are themselves AI-generated Can AI generate knowledge faster than humans can evaluate it?. An evaluation framework that ignores the cost of *verification* relative to the cost of *generation* is measuring a system that can already produce faster than it can be trusted. Put differently, the corpus suggests the right unit isn't 'capability,' it's 'capability per unit of compute, training and inference counted separately, with the cost of judging it on the same ledger.'

Sources 6 notes

Do automated benchmarks hide what frontier AI systems can really do?

Automated benchmarks both overstate and understate capability by privileging precisely-specified, auto-gradable tasks. Open-world evaluations of long-horizon messy tasks through qualitative log analysis—with cost explicitly reported—correct these distortions and catch emerging capabilities earlier.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How do knowledge injection methods trade off flexibility and cost?

Dynamic injection (RAG) maximizes flexibility but adds latency; static embedding is fastest but costly and inflexible; modular adapters balance efficiency with swappability; prompt optimization requires no training but only activates existing knowledge. Combining all three outperforms any single approach.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Does targeted human intervention outperform both full autonomy and exhaustive oversight?

AutoResearchClaw's confidence-routed CoPilot mode achieved 87.5% acceptance, substantially outperforming full autonomy (25%) and step-by-step oversight (50%). The key insight: selective interruption avoids both uncaught critical errors and the coherence degradation caused by constant human interruption.

Can AI generate knowledge faster than humans can evaluate it?

AI produces knowledge faster than human judgment can verify it, collapsing epistemic confidence just as monetary hyperinflation collapses purchasing power. The gap self-reinforces because evaluation tools are themselves AI-generated, trapping the system in acceleration.

How should evaluation frameworks account for the computational cost of frontier AI capability?

Sources 6 notes

Next inquiring lines