From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Recent progress in large language models (LLMs) has enabled agents that use tools, retrieve information, maintain memory, and execute long-horizon workflows. Yet evaluation remains largely model-centric, reducing agents to final-task success or benchmark accuracy while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate: agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer for tools and subagents, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, the system that translates model capability into long-horizon agent behavior. We therefore study scaling the harness through three core bottlenecks in agentic AI: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that operationalize system scaling, going beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. Alongside the framework, we develop and release CheetahClaws, a Python-native reference harness, and use it together with Claude Code and OpenClaw as concrete points of comparison that make harness-level design choices explicit. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.
The dominant story of recent AI progress has been model scaling: larger models, more data, stronger post-training, and higher benchmark scores. For agentic AI, this story is now incomplete. Once foundation models are embedded into tools, terminals, browsers, memory stores, and external services, their behavior is no longer determined by the model alone. It is determined by a system: how context is constructed, how memory is retrieved, how tools are invoked, how subagents are routed, how actions are verified, and how failures are audited.
These findings suggest that we need to rethink several parts of the agent system. Prompt engineering remains useful for local control, but long-horizon performance increasingly depends on reusable skills, persistent memory, disciplined context construction, and verification-aware execution. The key issue is not only context size, but context governance: what should be retrieved, compressed, ordered, refreshed, trusted, and kept active at each step. Memory is not merely a storage layer; the harder problem is memory quality, including what to store, what to discard, how to retrieve the right information at the right time, and how to avoid staleness, drift, contamination, and over-generalization. Multi-agent systems are not automatically collaborative; reliable collaboration requires explicit communication protocols and uncertainty sharing. Finally, the field still lacks a mature framework for agent evolution over time, including how agents should update skills, refine memory, communicate across roles, and remain auditable as they adapt.
Agentic AI is moving from isolated model inference to persistent system execution. As models are embedded into tools, memory stores, repositories, browsers, subagents, and external services, their behavior is increasingly shaped by the architecture around them. This paper has shown that future progress therefore requires system scaling: improving how agents construct context, maintain trustworthy memory, route skills, verify actions, govern tools, communicate across roles, and evolve over time. Claude Code, OpenClaw, and CheetahClaws illustrate that comparable models projected onto different harnesses produce qualitatively different agents, and that the harness, not the model alone, is now a primary source of practical capability.