Does ordering training data by rarity actually improve language models?
Can sorting rare sentences before common ones during fine-tuning help LLMs learn more effectively? This challenges the intuition that models should see easy examples first.
Curriculum Textual Frequency Training (CTFT) is the third leg of Adam's Law's framework, and it inverts the intuitive curriculum-learning directionality. Standard curriculum learning sorts examples easy-to-hard along a conceptual difficulty axis: simple arithmetic before multi-step proofs, short translations before long ones. CTFT instead sorts examples by sentence-level corpus frequency and feeds the model the rare sentences first and the common sentences last. Rare comes first because rare is what the model's prior is weak on; saving the dense, well-modeled region for the end stabilizes the trajectory.
The reframe matters more than the technique. For an LLM, "easy" and "hard" are not properties of the concept being expressed — they are properties of the distance from the pre-training distribution. A formally simple sentence in a rare register can be harder for the model than a complex sentence in a textbook register. This connects to Does gradually tightening token budgets beat fixed budget training?: both findings argue that curriculum design for LLMs is fundamentally about managing distributional pressure, not pedagogical scaffolding. It also extends Does training data format shape reasoning strategy more than domain?: format and frequency are both statistical-position properties that drive learning more than the semantic content of the examples.
The methodological lesson generalizes beyond CTFT itself. Any curriculum-design choice for LLMs that uses the human-facing "easy/hard" gloss without checking distributional position is partly mis-specified. The replacement frame is "near/far from prior" — the model finds near-prior examples easy not because they are simple but because they are dense, and far-prior examples hard not because they are complex but because they are sparse. CTFT's contribution is operationalizing that frame into a concrete sentence-frequency ordering, with story-completion distillation (TFD) as the closed-source workaround for estimating frequencies on models whose training data we cannot see directly.
Source: Natural Language Inference Paper: Adam's Law: Textual Frequency Law on Large Language Models
Related concepts in this collection
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
curriculum design as distributional pressure management
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format and frequency both override domain content
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
curriculum textual frequency training reverses easy-to-hard intuition by ordering data low-to-high frequency