Agentic and Multi-Agent Systems

Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Note · 2026-02-23 · sourced from Agents

The LIMI paper challenges the core assumption that agentic capability scales with training data volume. Using only 78 carefully designed training samples — capturing complete multi-turn interaction sequences including tool use, reasoning, and environmental feedback — LIMI achieves 73.5% on AgencyBench, dramatically outperforming Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI shows 53.7% improvement over models trained on 10,000 samples.

Three innovations drive this:

  1. Agentic query synthesis — human-AI collaborative collection from real-world scenarios plus systematic GitHub PR-based synthesis, ensuring ecological validity
  2. Complete trajectory collection — full multi-turn sequences from task understanding through tool utilization to successful completion, not isolated demonstrations
  3. The Agency Efficiency Principle — machine autonomy emerges from strategic curation, not data accumulation

This extends a pattern now documented across three capability domains: reasoning (LIMO achieved complex math with 817 samples), instruction-following (LIMA achieved alignment with 1,000 examples), and now agency. Because Do base models already contain hidden reasoning ability?, the mechanism is likely the same: curated demonstrations activate latent agentic patterns already embedded through pretraining on code, documentation, and workflow descriptions. The training data doesn't teach agency — it triggers the phase transition from passive language model to active agent.

The practical implication challenges the resource-intensive approach to building agentic systems. If 78 demonstrations outperform 10K, the bottleneck is data quality and trajectory design, not data volume. Since Can models improve themselves on tasks without verifiable answers?, there appears to be a consistent principle: capability activation requires showing the model what it looks like to use a capability, not exhaustive training.


Source: Agents

Related concepts in this collection

Concept map
16 direct connections · 150 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

agency emerges from strategic curation of 78 demonstrations not data abundance — challenging scaling paradigms for agentic intelligence