Code as Agent Harness

Paper · arXiv 2605.18747
Tool Use and Computer-Use AgentsLLM AgentsAction ModelsMulti-Agent Architectures

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

LLM agents reason, act, and model their environments. Program-aided reasoning methods externalize intermediate computation into executable code [6, 7, 8]; robotic and embodied agents use generated programs as executable policies for interacting with physical or simulated worlds [9, 10]; and software-engineering or interactive environments use codebases, execution traces, tests, and runtime feedback as structured representations of environment state and dynamics, in which agents plan, act, and revise their behavior [11, 5, 12]. Taken together, these developments suggest a broader view: code is not only an artifact generated by LLMs, but also an executable, inspectable, and stateful medium through which agents reason, act, observe feedback, and verify progress. We refer to this view as code as agent harness.

To clarify the role of code in this broader harness view, we distinguish three coupled elements of long-running agentic systems: model-internal capabilities, system-provided harness infrastructure, and agent-initiated code artifacts. Model-internal capabilities refer to the model’s reasoning, perception, planning, simulation, and evaluation abilities. System-provided harness infrastructure refers to the predefined tools, APIs, sandboxes, memory systems, validators, permission boundaries, telemetry, and workflows that connect model outputs to external actions and feedback, and forms the main focus of harness engineering [24, 25]. In contrast, agent-initiated code artifacts, which remain relatively underexplored, are interactive code objects that agents create, execute, observe, revise, persist, and share within the task execution loop.

Beyond the taxonomy, we examine how agent-initiated code interaction appears across five application domains. In coding assistance, agents author patches, tests, and issue-resolution workflows over live repositories [5, 57, 58]. In GUI and OS automation, agents synthesize and execute interface commands grounded in DOM trees, accessibility APIs, and executable evaluators [59, 60]. In scientific discovery, agents dynamically compose and execute hypothesis-testing pipelines spanning simulations, lab protocols, and data analysis [61, 62, 63, 64]. In personalization and embodied control, agents author and revise executable policies, simulators, and skill libraries in response to environment feedback [9, 10, 32]. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight, and extensions to multimodal environments. This survey provides a roadmap for studying code not only as something agents generate, but as the runtime medium through which they execute, adapt, and coordinate reliable behavior.