Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
LESS (Low-rank gradiEnt Similarity Search) selects instruction tuning data by estimating each example's influence on a target capability. Given a handful of examples embodying a specific skill (e.g., reasoning), LESS constructs a gradient datastore of low-dimensional features and selects training data whose gradient signatures are most similar to the target examples.
The headline result: training on LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. This is not just efficiency — it's a net improvement. The mechanism: mixed instruction tuning datasets contain examples that actively hinder specific capabilities. Since Does training data format shape reasoning strategy more than domain?, the wrong format examples can shift the model's reasoning strategy away from what the target task requires.
Three technical innovations make this practical for LLMs: (1) adaptation to the Adam optimizer (influence formulations traditionally assume SGD), (2) variable-length sequence handling (instruction data varies wildly in length, which derails standard gradient comparisons), and (3) low-rank gradient features that compress the storage and computation to feasible levels.
The transferability finding is striking: smaller models can select useful data for larger models, and models from different families can share data selections. This suggests the gradient-based quality signal captures something about the data's intrinsic fit with a capability — not just its fit with a particular model's current state. The qualitative analysis confirms this: LESS selects data that "goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills."
This connects to the broader pattern that data quality dominates data quantity. Can models improve themselves on tasks without verifiable answers? showed 1000 well-chosen examples can catalyze general self-improvement. Does teacher-refined data always improve student model performance? showed that data needs to match the student. LESS provides the principled mechanism for finding that match.
Source: Training Fine Tuning
Related concepts in this collection
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?
complementary: LESS finds the right 5%, catalyst data shows 1000 examples suffice
-
Does teacher-refined data always improve student model performance?
Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
LESS provides the mechanism for student-aware selection
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
explains why wrong data hurts: format mismatch shifts reasoning strategy
-
Does self-generated training data improve model learning?
Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
related: data-learner compatibility as the key variable
-
What makes test-time training actually work in practice?
Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
LESS provides the principled mechanism for TTT's first required component (task-similar finetuning): gradient-based influence estimation can identify the most relevant subset for the task-similar finetuning stage, making TTT's first component more efficient and less fragile than heuristic data selection
-
Can 78 demonstrations teach agency better than 10000?
Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
LIMI's 78-trajectory result is the agentic analog of LESS's finding: strategic curation outperforms volume; LESS provides the mechanism (gradient-based selection) that could identify which agentic trajectories matter most
-
Can 1000 carefully chosen examples align models effectively?
Does alignment require massive datasets, or can strategic curation of small, high-quality examples achieve comparable performance? LIMA tests whether quality beats quantity in post-training.
LIMA demonstrates the target state (1000 curated examples suffice for alignment); LESS provides the mechanism for reaching that state (gradient-based selection operationalizes what "careful curation" means computationally)
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
gradient-based influence estimation identifies 5 percent of instruction data that outperforms training on the full dataset