How do model capabilities differ from harness infrastructure in agents?
What distinct layers make up an agentic system, and how do failures in each layer differ? Understanding this decomposition helps pinpoint whether problems stem from the model, the infrastructure, or the agent's own code.
Talking about "agent code" as one thing obscures three distinct elements that the code-as-harness survey separates. First, model-internal capabilities: the reasoning, perception, planning, simulation, and evaluation abilities baked into the model's weights. Second, system-provided harness infrastructure: the predefined tools, APIs, sandboxes, memory systems, validators, permission boundaries, telemetry, and workflows that connect model outputs to external actions and feedback — this is the main focus of harness engineering. Third, agent-initiated code artifacts: the interactive code objects an agent itself creates, executes, observes, revises, persists, and shares within the execution loop. These three are coupled but governed by different design levers.
The decomposition is useful because each element fails and improves differently. You strengthen model-internal capability by training; you strengthen the harness by engineering infrastructure; you strengthen agent-initiated artifacts by shaping how the agent generates and reuses its own code. Confusing them leads to misattributed failures — blaming the model for what is really a harness gap, or vice versa. The counterpoint is that the boundaries blur in practice: a skill the agent writes once may be promoted into harness infrastructure, and harness validators shape what the model learns to emit. But as an analytical frame it clarifies where to intervene. This matters because it gives harness engineering a vocabulary for separating the controllable layers of an agentic system.
— "Code as Agent Harness: Toward Executable, Verifiable, and Stateful Agent Systems", https://arxiv.org/abs/2605.18747
Related concepts in this collection
-
Where does agent reliability actually come from?
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
both isolate the harness as a layer distinct from the model and a primary source of capability
-
Does a single benchmark score actually predict agent readiness?
Single-axis benchmarks rank models by one capability—like task success—but ignore privacy, duration, operating mode, and ecosystem fit. Can one number really capture what matters for deployment?
parallels the move to decompose agent ability into separable components rather than one scalar
-
What makes agent-created code artifacts so hard to manage?
Agent-authored code that persists and is shared across systems raises difficult questions about what should be kept versus discarded, and how to maintain consistent state when multiple agents collaborate on the same artifacts.
extends: same survey; this decomposition names the three elements and that companion note singles out the third (agent-initiated artifacts) as the least-studied frontier
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
agent code splits into model-internal capability system-provided harness and agent-initiated artifacts