Agentic Systems and Planning

Should we evaluate deployed agents as whole environments instead?

Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?

Note · 2026-05-28 · sourced from Work Application Use Cases

LLM systems are conventionally evaluated as models, benchmarks, or short conversational episodes. This case study argues the unit of analysis should instead be the whole human-agent environment: the researcher plus the agent runtime, durable memory files, tool access, repositories, scheduled jobs, specialized agent roles, and safety protocols, observed over time. Its PARE-M framework measures architecture, utilization, artifact production, resource use, reproducibility, and governance together.

This matters because the three conventional units all factor out exactly what makes a deployed agent useful. A model benchmark holds context fixed; an episode benchmark resets state; both evaluate bounded tasks. But the case shows the capacity gains came from accumulated context plus reusable procedures — properties that only exist across sessions and only when a human is in the loop directing, correcting, and accreting memory. Measured at the model or episode level, the most important variable is invisible.

The counterpoint is severe and the paper concedes it: an n-of-1 self-observed study has no control, no generalizability, and obvious reflexivity risk. But the contribution is not the effect size — it is the argued unit of analysis. Even a single rigorously instrumented environment (75,671 de-duplicated telemetry records, 889 governance events) demonstrates that the human-agent coupling is measurable and behaves differently from bounded benchmarks. Therefore the claim survives the small-n objection: you cannot evaluate a lived deployment by summing model scores, because the system is the human, the agent, and their shared memory together.


— "Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study", https://arxiv.org/abs/2605.26870

Related concepts in this collection

Concept map
16 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

the right unit of llm evaluation is the coupled human-agent environment not the model or the episode