Can 78 demonstrations teach agency better than 10000?

Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.

Note · 2026-02-23 · sourced from Agents

The LIMI paper challenges the core assumption that agentic capability scales with training data volume. Using only 78 carefully designed training samples — capturing complete multi-turn interaction sequences including tool use, reasoning, and environmental feedback — LIMI achieves 73.5% on AgencyBench, dramatically outperforming Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI shows 53.7% improvement over models trained on 10,000 samples.

Three innovations drive this:

Agentic query synthesis — human-AI collaborative collection from real-world scenarios plus systematic GitHub PR-based synthesis, ensuring ecological validity
Complete trajectory collection — full multi-turn sequences from task understanding through tool utilization to successful completion, not isolated demonstrations
The Agency Efficiency Principle — machine autonomy emerges from strategic curation, not data accumulation

This extends a pattern now documented across three capability domains: reasoning (LIMO achieved complex math with 817 samples), instruction-following (LIMA achieved alignment with 1,000 examples), and now agency. Because Do base models already contain hidden reasoning ability?, the mechanism is likely the same: curated demonstrations activate latent agentic patterns already embedded through pretraining on code, documentation, and workflow descriptions. The training data doesn't teach agency — it triggers the phase transition from passive language model to active agent.

The practical implication challenges the resource-intensive approach to building agentic systems. If 78 demonstrations outperform 10K, the bottleneck is data quality and trajectory design, not data volume. Since Can models improve themselves on tasks without verifiable answers?, there appears to be a consistent principle: capability activation requires showing the model what it looks like to use a capability, not exhaustive training.

Source: Agents

Related concepts in this collection

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: the same minimal-activation principle applies to agency, not just reasoning
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?
parallel finding: 1000 demonstrations activate reasoning; 78 demonstrations activate agency
Can a single training example unlock mathematical reasoning? Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.
strongest version of minimal-data activation; LIMI is the agentic equivalent
Can we train better models on less data? Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary mechanism: influence estimation identifies which data matters
Can agents learn continuously through memory without updating weights? Explores whether LLM agents can adapt to new tasks and failures by retrieving and updating past experiences stored in memory, rather than requiring expensive parameter fine-tuning.
AgentFly's case bank grows from experience; the 78-demonstration efficiency principle suggests a small number of high-quality cases may suffice for the case bank to bootstrap effective retrieval

Concept map

16 direct connections · 150 in 2-hop network ·dense cluster

Can 78 demonstrations teach agency better than 1… Do base models already contain hidden reasoning ab… Can models improve themselves on tasks without ver… Can a single training example unlock mathematical … Can we train better models on less data? Can agents learn continuously through memory witho…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

agency emerges from strategic curation of 78 demonstrations not data abundance — challenging scaling paradigms for agentic intelligence