Why does curriculum order matter when information theory says data order is irrelevant?

This explores a real tension: classical information theory treats a dataset's total information as order-invariant, yet curriculum learning keeps showing that the sequence you feed examples in changes the outcome — so the corpus's answer is that order matters because the learner isn't a fixed container, it's a moving target that reshapes itself as it reads.

This explores why sequencing should matter at all when the math says a dataset carries the same information no matter how you shuffle it. The resolution running through the corpus: information theory's order-invariance assumes a *fixed* receiver absorbing a joint distribution. A model under gradient descent is not fixed — it's a path-dependent system whose current state decides what the next example can even teach it. Order matters because the learner is non-stationary, not because the data's information content changes.

Several notes make this concrete from different angles. The sharpest reframing is that curriculum isn't really about "easy then hard" pedagogy at all — it's about distance from where the model already is. Rare-to-common ordering beats standard curricula because rarity signals a gap in the model's existing distribution, not conceptual difficulty Does ordering training data by rarity actually improve language models?. The same logic appears in reverse: teacher-refined data that is objectively higher quality *degrades* a student when it lands beyond the student's current learning frontier Does teacher-refined data always improve student model performance?. "Better data" and "learnable-right-now data" are different things, and only the second one survives contact with a moving learner.

The most striking case is when order decides whether a signal is informative *at all*. Running imitation training first and verifiable-reward training second beats either alone, because the imitation phase produces reasonable attempts that make the later reward signal meaningful — reorder it and the reward has nothing to sharpen Does sequencing imitation then exploration training improve reasoning?. Order also has mechanical, almost physical effects on the model's internal dynamics: training structured tasks before creative ones prevents entropy collapse from wrecking open-ended ability, a 6%+ swing that pure data content can't explain Does training order reshape how models handle different task types?. And clever sequencing can manufacture supervision that the raw data never contained — sliding a reasoning start-state backward turns plain outcome feedback into step-level guidance Can curriculum learning approximate expensive process supervision?.

The thing you may not have known you wanted to know: this is the *opposite* of how models treat order at inference time. By default LLMs largely ignore the temporal order of a user's interaction history when ranking, and you have to prompt order-sensitivity back in Why do language models ignore temporal order in ranking?. So order is nearly invisible to a frozen model reading a sequence, yet decisive for a model *learning* from one. The corpus even hints at a deeper version of this asymmetry: format and presentation shape reasoning strategy far more than the underlying content does Does training data format shape reasoning strategy more than domain?, and you can build it from sparsity signals without any external difficulty labels Can representation sparsity order few-shot demonstrations effectively?. Information theory measures what's in the data; curriculum measures what a particular learner, in a particular state, can pick up next.

Sources 8 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Why do language models ignore temporal order in ranking?

LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.

Does training data format shape reasoning strategy more than domain?

Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Why does curriculum order matter when information theory says data order is irrelevant?

Sources 8 notes

Next inquiring lines