INQUIRING LINE

What components of agent scaffolding most impact domain-specific output quality?

This explores which parts of the 'harness' around a model — memory, skills, coordination, context handling — do the heavy lifting for quality, rather than asking whether a bigger model alone is enough.


This reads the question as: when you wrap a model in scaffolding (memory, tools, multiple agents, context managers), which of those pieces actually moves output quality on a specific domain — and the corpus has a clear through-line: the surrounding system, not raw model scale, is where quality lives. One synthesis names this directly: reliable agents work by externalizing three cognitive burdens — memory (state persistence), skills (reusable procedures), and protocols (structured interaction) — into a harness layer so the model stops re-solving the same problems every turn Where does agent reliability actually come from?. That's the short answer to "which components": memory, skills, and interaction protocols carry the load.

The most consistent finding is that *how agents coordinate and exchange information* matters more than how smart any single agent is. Structured artifacts beat conversation: agents that hand each other standardized engineering documents (rather than chatting) coordinate far better, because the artifact strips out noise and lets agents pull exactly what they need Does structured artifact sharing outperform conversational coordination?. On a hard domain task — writing scientific papers — specialized multi-agent orchestration won by 50–68% on literature review against a single autonomous model, largely because distributing the work avoids the context-window collapse a lone model hits on complex synthesis Can specialized agents write better scientific papers than single models?. But there's a sharp caveat: roughly 80% of multi-agent performance variance turns out to track token budget, not coordination cleverness How does test-time scaling work at the agent level? — so before you credit your orchestration design, check whether you're just spending more.

Context handling is the next big lever, and it's adaptive, not one-size-fits-all. A separately trained context manager can prune what a frozen agent sees, and the surprising rule is that stronger agents want high-fidelity context preserved while weaker agents need *aggressive* compression to stay reliable Can external managers compress context better than frozen agents?. So the same scaffolding component should be tuned in opposite directions depending on the model underneath it. Relatedly, you don't need a frontier model everywhere — small language models handle most repetitive, well-defined subtasks at 10–30× lower cost, making the highest-quality-per-dollar design a heterogeneous one: SLMs by default, large models only where they earn it Can small language models handle most agent tasks?.

For domain *specialization* specifically, the corpus warns that scaffolding can't be retrofitted by fine-tuning alone. Turning an LLM into an action-capable agent takes a four-stage pipeline — curating domain action/environment data, training for grounding, integrating memory-and-tool infrastructure, and safety evaluation — and it's the surrounding system that decides whether actions are grounded or hallucinated Can you turn an LLM into an agent by just fine-tuning?. There's a deeper ceiling too: agents trained only on static expert demonstrations can't generalize past what the curator imagined, because they never interact with the environment to learn from their own failures Can agents learn beyond what their training data shows?. Domain quality, in other words, is bounded by whether the scaffold lets the agent *practice*, not just imitate.

The quiet thread worth taking away: scaffolding components also have failure modes that silently cap quality, and you only see them if you measure the right thing. Agentic evaluation with live evidence collection cut judge error 100× over LLM-as-judge — yet its own memory module cascaded errors, showing that even reliability-boosting components need error isolation Can agents evaluate AI outputs more reliably than language models?. That's why one line argues evaluation itself must move past one-shot task success to score trajectory quality, memory hygiene, and context efficiency — the harness, not just the answer What should we actually measure in agent evaluation?. And if you'd rather not hand-tune all this, representing the whole agent as a computational graph lets you optimize both the prompts and the wiring between agents automatically Can we automatically optimize both prompts and agent coordination?.


Sources 11 notes

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can specialized agents write better scientific papers than single models?

PaperOrchestra's specialized agents achieved 50-68% absolute win margins on literature review quality and 14-38% on overall manuscript quality versus autonomous baselines in human evaluation. Distributed coordination prevents single-model context window failures on complex synthesis tasks.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can you turn an LLM into an agent by just fine-tuning?

Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Next inquiring lines