SYNTHESIS NOTE
Agentic Systems and Tool Use Training, RL, and Test-Time Scaling

Do stronger models always evolve their own harnesses better?

When AI agents self-improve their prompts and tools, does raw model power help equally at writing updates versus using them? Understanding this split could reshape how we design self-evolving systems.

Synthesis note · 2026-06-03 · sourced from Evolution

Self-evolving agents edit an external harness — prompts, skills, memories, tools — from execution evidence, without touching model parameters. The natural assumption is that stronger base models do this better on both ends. This paper disentangles two distinct capabilities and finds neither follows that assumption.

Harness-updating — producing persistent edits that lead to gains — is flat in base capability. Models across capability tiers produce updates yielding surprisingly similar gains; even a Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6. Writing a good skill or memory is apparently not bottlenecked by raw model strength.

Harness-benefit — actually improving when handed an updated harness — is non-monotonic. Weak-tier models gain little, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. Two failure modes explain the weak end: failing to activate the relevant harness artifact, and failing to follow it faithfully once activated.

The practical inversion is sharp: invest capability budget in the agent that uses the harness, not the evolver that writes it — and target agent training at harness invocation and long-horizon instruction-following rather than at generating cleverer updates. This complicates the "let a frontier model improve everything" intuition and connects to Why do better reasoning models ignore instructions?: strong models may benefit less precisely because the bottleneck is faithful instruction-following, which scaling erodes.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 121 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the capacity to produce useful harness updates is flat across model tiers but the capacity to benefit from them peaks at mid-tier