Does fine-tuning on NLI tasks reduce or amplify frequency bias?

This explores whether training a model specifically on natural language inference (NLI) — the task of judging whether one sentence implies another — actually teaches reasoning, or whether it just sharpens a shortcut the model already had.

This explores whether fine-tuning on NLI teaches genuine inference or just deepens a statistical shortcut — and the corpus answers clearly: it amplifies frequency bias rather than reducing it. The core finding is that NLI fine-tuning makes a model lean harder on which words appear more often in its training corpus (hypernyms like 'animal' are more common than hyponyms like 'beagle'), and it uses that frequency signal as a proxy for entailment. The tell is adversarial cases: when the frequency pattern points one way but the actual entailment label points the other, fine-tuned models do worse than before. The shortcut didn't get corrected by training — it got entrenched Does fine-tuning on NLI teach inference or amplify shortcuts?.

Why would more training make a bias stronger instead of weaker? A complementary line in the corpus suggests the bias isn't really created by fine-tuning at all — it's planted during pretraining and only nudged afterward. A causal study using random seeds and cross-tuning found that models sharing a pretrained backbone show the same bias patterns no matter what data you fine-tune them on; fine-tuning modulates, it doesn't author Where do cognitive biases in language models come from?. Read together, the two notes tell one story: the frequency prior lives deep in the weights, and a task like NLI — where frequency happens to correlate with the right answer most of the time — gives the model a reason to rely on it even more.

This is an instance of a broader pattern worth seeing: strong training-time priors override what's actually in front of the model. Language models routinely ignore information in their context when parametric knowledge from training is confident, and plain prompting can't talk them out of it — you need to intervene in the representations themselves Why do language models ignore information in their context?. NLI frequency bias is the same dynamic at the level of a single inference: the corpus-level prior outshouts the semantic relationship the task is supposed to test.

The amplification effect also isn't unique to NLI or to supervised fine-tuning. Reinforcement learning shows a structurally similar move — RL post-training latches onto one dominant format already present in the pretraining distribution and suppresses the alternatives, often within a single epoch, picking the winner by prevalence rather than by performance Does RL training collapse format diversity in pretrained models?. Different objective, same gravitational pull toward whatever the base model already does most. And the reason these shortcuts survive is that surface statistics genuinely capture a lot: models that ace easy NLI nonetheless fail systematically on deeper structure, misreading embedded clauses and complex grammar as syntactic depth increases — evidence that statistical pattern-matching and real grammatical understanding are different things wearing the same score Why do large language models fail at complex linguistic tasks?.

The thing you didn't know you wanted to know: a benchmark improvement after fine-tuning can mean the model learned the *task* or learned a *correlate* of the task, and the two look identical until you build adversarial cases that pry them apart. NLI is a clean place to catch the difference — but the lesson generalizes to almost any fine-tuning result you're tempted to trust.

Sources 5 notes

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does fine-tuning on NLI tasks reduce or amplify frequency bias?

Sources 5 notes

Next inquiring lines