Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Paper · arXiv 2604.08224 · Published April 9, 2026
Design FrameworksMemoryTool Computer UseRoutersFoundation Models

Large language model (LLM) agents are increasingly built less by changing model weights than by reorganizing the runtime around them. Capabilities that earlier systems expected the model to recover internally are now externalized into memory stores, reusable skills, interaction protocols, and the surrounding harness that makes these modules reliable in practice. This paper reviews that shift through the lens of externalization. Drawing on the idea of cognitive artifacts, we argue that agent infrastructure matters not merely because it adds auxiliary components, but because it transforms hard cognitive burdens into forms that the model can solve more reliably. Under this view, memory externalizes state across time, skills externalize procedural expertise, protocols externalize interaction structure, and harness engineering serves as the unification layer that coordinates them into governed execution. We trace a historical progression from weights to context to harness, analyze memory, skills, and protocols as three distinct but coupled forms of externalization, and examine how they interact inside a larger agent system. We further discuss the trade-off between parametric and externalized capability, identify emerging directions such as self-evolving harnesses and shared agent infrastructure, and discuss open challenges in evaluation, governance, and the long-term co-evolution of models and external infrastructure. The result is a systems-level framework for explaining why practical agent progress increasingly depends not only on stronger models, but on better external cognitive infrastructure.

The history of human civilization can also be read as a history of cognitive externalization. Spoken language transformed private thought into shareable symbolic form. Writing moved knowledge from fragile biological memory into persistent material records. Printing mechanized the reproduction of knowledge at social scale. Digital computation relocated arithmetic and symbolic manipulation from neural labor to programmable machines. Across these transitions, the critical change was not that humans became less capable without the artifact. Rather, the artifact reorganized the cognitive system by shifting selected burdens outward and freeing limited internal resources for planning, abstraction, and creativity [Norman, 1993]. The same pattern of outward delegation now recurs at the frontier of machine intelligence, in the design of large language model agents.

This perspective has a natural theoretical anchor in the idea of cognitive artifacts [Norman, 1991, 1993]. The central insight is that external aids do not merely amplify an unchanged internal ability; they often transform the task itself. A shopping list does not expand biological memory capacity. It changes a difficult recall problem into a recognition problem. A map does not simply make navigation “stronger.” It converts hidden spatial relations into visible structure. The power of an artifact therefore lies in representational transformation: it restructures the problem so that the agent can solve it more reliably with the competencies it already has [Norman, 1991].

We argue that the same logic now governs the most consequential design choices in LLM-based agents. Our central thesis is that externalization—the progressive relocation of cognitive burdens from the model’s internal computation into persistent, inspectable, and reusable external structures—is the transition logic— the mechanism that explains why each architectural shift has occurred and what forms of reliability it sought to preserve—that unifies recent advances in memory, skills, protocols, and harness engineering for language agents. This is not merely a claim about engineering convenience. It is a claim about where reliable agency comes from: not from ever-larger models alone, but from the systematic restructuring of task demands so that internal capabilities and external infrastructure jointly cover the full range of competencies required [Norman, 1991, Sumers et al., 2024].

Figure 1 summarizes the argument. The upper panel traces the familiar arc of human cognitive externalization; the middle panel presents the corresponding arc for LLM agents, from weights through three externalization dimensions—memory, skills, and protocols—to the harness that unifies them; the lower panel maps the resulting literature landscape onto three capability layers—Weights, Context, and Harness. Figure 3 complements this view with an architectural overview of the externalized agent, showing the harness at the center with the three externalization dimensions and their operational elements orbiting it. Memory externalizes state across time, skills externalize procedural expertise, and protocols externalize interaction structure. The parallel between the two arcs encodes a recursive claim: LLM agents are themselves artifacts operating inside the latest major human externalization, digital computation. The common mechanism is representational transformation in Norman’s sense [Norman, 1991]: recall becomes recognition, improvised generation becomes composition, and ad hoc coordination becomes structured contract.

This lens is especially clarifying for understanding current practice. Contemporary progress is often narrated as a race for larger models, better training procedures, or more sophisticated reasoning traces. Those factors matter, but they do not fully explain the pattern observed in practical systems. Many of the largest gains in reliability do not come from changing the base model at all. They come from changing the environment around the model: adding persistent memory, organizing reusable skills, standardizing tool interfaces, constraining execution, instrumenting behavior, and routing work through explicit control logic [Sumers et al., 2024, Wang et al., 2024a, Li, 2025, Luo et al., 2025]. In practice, the question is increasingly not only “how capable is the model?” but also “what burdens have been externalized so the model no longer has to solve them internally every time?”

An unaided LLM still faces three recurrent mismatches that map directly onto the three harness dimensions. Its context window is finite and session memory is weak or absent, creating a continuity problem that memory externalization addresses. Long multi-step procedures are often rederived rather than executed consistently, creating a variance problem that skill externalization addresses. Interactions with external tools, services, and collaborators remain brittle when left to free-form prompting alone, creating a coordination problem that protocol externalization addresses [Sumers et al., 2024, Packer et al., 2023]. Externalization matters because it turns each of these burdens into a form the model can handle more reliably.

A concrete example helps fix the intuition. Consider a software engineering agent asked to implement a feature in a large repository, run tests, and open a pull request. Without externalization, the model must keep repository structure, project conventions, workflow state, and tool interactions active inside a fragile prompt. With externalization, persistent project memory supplies context, reusable skill documents encode conventions and workflow, protocolized tool interfaces enforce correct schemas, and the harness sequences steps, validates outputs, and manages failures. The base model may remain unchanged; what changes is the representation of the task it is asked to solve.

This broader perspective also aligns with the intuition behind distributed and extended cognition: once crucial parts of remembering, guiding action, and coordinating interaction are delegated to external structures, intelligence is no longer localized in the model alone [Clark and Chalmers, 1998]. We draw on this tradition for its core engineering insight—that the boundary between “agent” and “environment” is a design choice with real performance consequences—rather than committing to its stronger ontological claims. Our focus is pragmatic: we treat externalization as a design principle whose value is measured by the reliability, composability, and governability of the resulting system.

We now turn to the three dimensions of externalization that constitute the harness, each corresponding to one of the representational transformations highlighted in Figure 1 (middle panel): Memory systems externalize state across time. Rather than relying on the context window as the sole carrier of history, memory systems allow accumulated knowledge—user preferences, prior trajectories, resolved ambiguities, domain facts—to persist beyond any single session and be selectively retrieved when relevant. The core transformation is from recall to recognition: the agent no longer needs to regenerate past knowledge from latent weights; it retrieves it from a persistent, searchable store [Lewis et al., 2020, Park et al., 2023, Packer et al., 2023, Chhikara et al., 2025, Xu et al., 2025b].

Skill systems externalize procedural expertise. Rather than relying on the model’s weights to regenerate task-specific know-how on every invocation, skill systems package procedures, best practices, and operating guidance into reusable artifacts. The core transformation is from generation to composition: the agent assembles behavior from pre-validated components rather than improvising each step de novo [OpenAI, 2023a, Schick et al., 2023, Wang et al., 2023a, Anthropic, 2025, 2026, Jiang et al., 2026b].

Protocols externalize interaction structure. Rather than relying on ad hoc prompt-level coordination with tools, services, and other agents, protocols define explicit machine-readable contracts for discovery, invocation, delegation, and permission management. The core transformation is from ad-hoc to structured: ambiguous, fragile communication becomes interoperable, governable exchange [Anthropic, 2024, Google Cloud, 2025a, Ehtesham et al., 2025c].

The harness is the engineering layer that hosts all three dimensions and provides the orchestration logic, constraints, observability, and feedback loops that make externalized cognition cohere in practice. It is not a fourth kind of externalization alongside memory, skills, and protocols. It is the runtime environment within which these forms of externalization operate and interact.

These dimensions do not evolve in isolation. Memory expansion can compete with skill loading for scarce context budget. Protocol standardization can improve interoperability while constraining how capabilities are packaged and invoked. Skill execution generates traces that later become memory, and memory retrieval can influence which skills and protocol paths are chosen next. The harness must mediate all of these interactions. We preview these system-level couplings here and analyze them in detail in Section 7.