LLMs are Frequency Pattern Learners in Natural Language Inference

Paper · arXiv 2505.21011 · Published May 27, 2025

While fine-tuning LLMs on NLI corpora improves their inferential performance, the underlying mechanisms driving this improvement remain largely opaque. In this work, we conduct a series of experiments to investigate what LLMs actually learn during fine-tuning. We begin by analyzing predicate frequencies in premises and hypotheses across NLI datasets and identify a consistent frequency bias, where predicates in hypotheses occur more frequently than those in premises for positive instances. To assess the impact of this bias, we evaluate both standard and NLI fine-tuned LLMs on bias-consistent and bias-adversarial cases. We find that LLMs exploit frequency bias for inference and perform poorly on adversarial instances. Furthermore, fine-tuned LLMs exhibit significantly increased reliance on this bias, suggesting that they are learning these frequency patterns from datasets. Finally, we compute the frequencies of hyponyms and their corresponding hypernyms from WordNet, revealing a correlation between frequency bias and textual entailment. These findings help explain why learning frequency patterns can enhance model performance on inference tasks

To further investigate the relationship between predicate frequency and entailment relation, we analyze the frequency of hyponym–hypernym pairs3 extracted from WordNet (Miller, 1994). These pairs represent a specific form of upward semantic entailment, namely generalization, where a more specific concept (hyponym) entails its corresponding general one (hypernym). For example, “whisper” is a hyponym of “talk”, so the statement “X whisper to Y” semantically entails “X talk to Y”. Table 5 reports the average frequency of hyponyms and hypernyms separately. Hypernyms occur more frequently than their corresponding hyponyms, indicating that more general concept are more frequent in natural language. This finding suggests that frequency bias may serve as a proxy for the generalization gradient, whereby inferences from lower-frequency to higher-frequency predicates reflect a generalization from specific to more abstract concepts. In NLI datasets, we also observe that when the label is Entail, hypotheses often contain more hypernyms than corresponding premises, as shown in Appendix C. This suggests that most samples in NLI datasets require inference from more specific concepts to more general ones.

LLMs learn frequency biases from datasets during training and these biases align with a generalization gradient that supports entailment from specific to more general concepts.