Should we evaluate deployed agents as whole environments instead?

Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?

Note · 2026-05-28 · sourced from Work Application Use Cases

LLM systems are conventionally evaluated as models, benchmarks, or short conversational episodes. This case study argues the unit of analysis should instead be the whole human-agent environment: the researcher plus the agent runtime, durable memory files, tool access, repositories, scheduled jobs, specialized agent roles, and safety protocols, observed over time. Its PARE-M framework measures architecture, utilization, artifact production, resource use, reproducibility, and governance together.

This matters because the three conventional units all factor out exactly what makes a deployed agent useful. A model benchmark holds context fixed; an episode benchmark resets state; both evaluate bounded tasks. But the case shows the capacity gains came from accumulated context plus reusable procedures — properties that only exist across sessions and only when a human is in the loop directing, correcting, and accreting memory. Measured at the model or episode level, the most important variable is invisible.

The counterpoint is severe and the paper concedes it: an n-of-1 self-observed study has no control, no generalizability, and obvious reflexivity risk. But the contribution is not the effect size — it is the argued unit of analysis. Even a single rigorously instrumented environment (75,671 de-duplicated telemetry records, 889 governance events) demonstrates that the human-agent coupling is measurable and behaves differently from bounded benchmarks. Therefore the claim survives the small-n objection: you cannot evaluate a lived deployment by summing model scores, because the system is the human, the agent, and their shared memory together.

— "Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study", https://arxiv.org/abs/2605.26870

Related concepts in this collection

What should we actually measure in agent evaluation? Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
synthesizes: both reject single-number model-centric evaluation; this note enlarges the unit to the whole human-agent environment while that one enlarges what within a trajectory gets scored — complementary expansions of the same critique
Can you turn an LLM into an agent by just fine-tuning? Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.
grounds: explains why model-level evaluation factors out what matters — capability lives in the surrounding pipeline (memory, tools, integration), exactly the components PARE-M instruments
Why do production AI agents stay deliberately simple? Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?
exemplifies: empirical deployment evidence that the harness around a frozen model carries the system — a case for measuring the environment, not the model
Is agent memory capacity or quality the real bottleneck? While more storage seems like the obvious solution to memory problems, what if the real constraint is actually curation—deciding what to keep, discard, and retrieve without degrading performance?
extends: accumulated durable memory is the cross-session variable PARE-M says you cannot see at episode level; memory quality is one of the environment properties that only becomes measurable over time

Concept map

16 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Should we evaluate deployed agents as whole envi… What should we actually measure in agent evaluatio… Can you turn an LLM into an agent by just fine-tun… Why do production AI agents stay deliberately simp… Is agent memory capacity or quality the real bottl…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

the right unit of llm evaluation is the coupled human-agent environment not the model or the episode

Should we evaluate deployed agents as whole environments instead?

Related concepts in this collection

Related papers in this collection