Does gradient-based influence estimation identify which alignment examples actually matter most?
This explores whether gradient-based methods — measuring how individual training examples nudge a model toward target capabilities — can actually pick out the alignment data that does the work, and what "mattering" even means once you look closely.
This explores whether gradient-based influence estimation can reliably surface the alignment examples that matter most — and the corpus suggests the answer is a qualified yes, but with a twist that reframes the whole question. The clearest evidence is LESS Can we train better models on less data?, which uses low-rank gradient features to find the instruction examples whose gradients most resemble those of a target capability. Training on the selected 5% consistently beats training on the full dataset. So gradients do identify what matters — but the striking part is *why*: the gains come not just from keeping good examples, but from dropping examples that actively hurt, by pulling the model's reasoning strategy away from what the task needs. "Mattering" turns out to be as much about removing interference as about finding gold.
That reframing connects to LIMA Can careful curation replace massive alignment datasets?, which reaches a parallel conclusion from the opposite direction: just 1,000 hand-curated examples rival models trained on orders of magnitude more data. LIMA's explanation is that post-training mostly *activates* capabilities the pretrained model already has rather than teaching new ones. Put the two together and a picture emerges — if alignment is mostly surfacing latent ability, then only a small, well-chosen slice of data is doing real work, and the rest is noise or active interference. Gradient influence is one automated way to find that slice; careful human curation is another.
But the corpus also raises a caution about reading too much into where alignment changes live. Proxy-tuning Can decoding-time tuning preserve knowledge better than weight fine-tuning? shows that direct fine-tuning corrupts knowledge stored in a model's lower layers, while decoding-time methods that leave weights untouched can close most of the alignment gap and preserve knowledge better. This complicates the influence story: an example's gradient tells you how it shifts weights, but if weight-shifting is itself partly destructive, then "most influential" and "most beneficial" may not be the same thing. An example could matter a lot in gradient magnitude precisely because it's damaging something.
There's also a question of what these methods optimize *toward*. Work on alignment dimensions Do different types of alignment serve different conversational goals? shows alignment isn't one axis — lexical alignment drives task efficiency, while emotional and prosodic alignment drive trust and warmth, and conflating them produces category errors. Gradient influence is defined relative to a target capability, so it inherits whatever you point it at; it can tell you which examples matter for a chosen objective, but it can't tell you that you picked the right objective. And RLHF-bias findings Do LLMs predict persuasion based on actual dialogue or training bias? are a reminder that alignment data can install systematic, invisible leanings — meaning the examples that "matter most" by an influence metric might be the very ones quietly skewing the model.
The thing you might not have known you wanted to know: the value of influence estimation may lie less in finding the examples that teach the model and more in finding the ones that *poison* it. LESS's biggest lever is subtraction. That flips the intuition behind "which examples matter most" — the most important examples in your alignment set might be the ones you're better off deleting.
Sources 5 notes
LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.
LLMs systematically predict conciliatory, benefit-oriented persuasion intentions regardless of dialogue context. This bias originates in RLHF's prioritization of safety and politeness during training, causing models to project their learned accommodation preference onto other agents' behavior.