INQUIRING LINE

How much does agent performance depend on demonstration quantity versus curation quality?

This explores whether you make an agent better by feeding it more demonstrations or by carefully selecting which ones it sees — and the corpus comes down hard on the curation side, with an important catch.


This reads the question as a contest between two levers — pile up more demonstrations, or curate a smaller set well — and the collection's strongest signal is that careful selection wins by a wide margin. The clearest case is LIMI, which hits 73.5% on an agency benchmark using only 78 hand-picked multi-turn trajectories, beating models trained on 10,000+ samples by more than 50 points Can careful selection of 78 demos outperform massive training datasets?. The proposed mechanism is worth sitting with: the complete interaction sequences don't teach new abilities so much as *activate* agentic patterns already latent in the pretrained model. If that's right, then most of your demonstrations are redundant, and the few that capture the full shape of tool use plus reasoning are doing all the work.

But the corpus doesn't let curation off the hook either — it names a ceiling. Agents trained only on static expert demonstrations can never learn beyond what the curator already imagined, because they never fail in a live environment and correct themselves Can agents learn beyond what their training data shows?. So the real picture is: quantity barely matters, curation quality matters enormously, but even perfect curation is capped by the curator's foresight. That reframes the question — past a point, the lever isn't "more demos" or "better demos" but "let the agent generate its own experience."

The same quality-over-quantity logic shows up far from training data, which is the lateral payoff here. In agent memory, adding storage capacity without curation actively *degrades* performance through staleness, drift, and contamination — the bottleneck is what to discard, not what to accumulate Is agent memory capacity or quality the real bottleneck?. In skill libraries, a separately trained curator that decides what to keep and refine outperforms letting a frozen agent dump generic additions Can a separate trained curator improve skill libraries better than frozen agents?. Across demonstrations, memory, and skills, the recurring finding is that an editorial function — choosing what survives — beats raw volume every time.

There's a deeper reframing worth knowing: much of what looks like "agent performance" isn't sitting in the demonstrations at all. One line of work argues reliability comes from externalizing memory, skills, and protocols into a harness layer, so the model isn't re-solving the same problems from scratch Where does agent reliability actually come from?. And how you *reward* matters as much as what you show — process-level supervision on intermediate steps beats outcome-only feedback Does supervising retrieval steps outperform final answer rewards?. Both suggest the quantity-vs-quality framing is itself a bit narrow: structure and signal-richness can substitute for sheer example count.

One caution the collection adds: you can't actually tell whether curation worked if you only measure one-shot task success. Single-score evaluation hides trajectory quality, memory hygiene, and verification cost — exactly the dimensions where good curation pays off What should we actually measure in agent evaluation?. So the honest answer to "how much does it depend on curation?" partly depends on measuring the right things in the first place.


Sources 7 notes

Can careful selection of 78 demos outperform massive training datasets?

LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Next inquiring lines