INQUIRING LINE

Why do older datasets show higher LLM performance than newer ones?

This explores whether the age of a dataset is really what drives LLM scores, or whether 'older = higher performance' is a stand-in for something else — how heavily that data overlaps with what the model was trained on.


This explores why older benchmarks tend to flatter LLMs while newer ones expose them — and the corpus's answer is that age itself isn't the cause. What actually moves performance is distributional coverage: how well-represented a kind of data is in the training corpus. Older datasets often score higher not because they're old, but because they (or near-duplicates of them) were absorbed during pretraining, so the model is effectively recalling rather than reasoning.

The sharpest evidence is a case that runs the *opposite* direction, which is exactly why it's useful. In Why do language models struggle with historical legal cases?, models do *worse* on historical legal cases — because recent cases are over-represented in training, giving the model shallow representations of older precedent. Same underlying lever (training-corpus representation), opposite surface result (here, newer data wins). That tension is the tell: when you see 'older datasets perform better,' you're usually seeing 'data the model saw more of performs better,' and age is just a proxy for exposure.

The distributional framing gets formalized elsewhere. Does ordering training data by rarity actually improve language models? reframes difficulty itself as distance from the pretraining distribution rather than conceptual hardness — rare data is hard because it's under-covered, not because it's complex. And Can we predict where language models will fail? makes the prediction explicit: tasks with low-probability target outputs are systematically harder regardless of logical simplicity. A newer dataset, by construction, sits further from the training distribution and contains lower-probability targets — so scores drop, and it can look like the model 'got worse' when really the benchmark just stopped overlapping with training.

There's a generalization angle worth knowing too. Can smaller models outperform their LLM teachers with enough data? shows a student model beating its teacher precisely because it was exposed to a *broader* input distribution. The lesson cuts both ways: performance is downstream of distribution coverage, so the fix for inflated old-benchmark scores isn't newer data per se — it's data that genuinely widens what the model has seen.

One honest gap: this corpus doesn't contain a dedicated note on benchmark *contamination* (test sets leaking into training), which is the most common direct explanation for the old-vs-new gap you're asking about. What it does give you is the deeper principle that makes contamination matter — performance tracks distributional proximity, and dataset age is mostly a noisy signal for it.


Sources 4 notes

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Next inquiring lines