Why does the order of training examples matter for what models learn?

This explores why the *sequence* in which a model sees training examples — not just which examples — changes what it ends up learning, spanning curriculum order, the structure inside individual examples, and how order reshapes a model's internal dynamics.

This explores why the *sequence* in which a model sees training examples — not just which examples — changes what it ends up learning. The corpus answers on two levels: the order of examples across training (curriculum), and the order of steps inside a single example (structure). Both turn out to matter, but for different reasons.

At the curriculum level, the surprising finding is that the intuitive 'easy-to-hard' ordering can be the wrong instinct. One line of work flips it: feeding rare data *first* beats standard curricula, because what matters isn't pedagogical difficulty but distance from the pre-training distribution — rarity signals where the model is weakest Does ordering training data by rarity actually improve language models?. A complementary thread orders few-shot demonstrations by representation sparsity, hard-to-easy, with no external difficulty labels needed Can representation sparsity order few-shot demonstrations effectively?. And in reinforcement learning, the order across *task types* mechanically reshapes the model's internals: training structured tasks before creative ones prevents entropy collapse from killing open-ended ability, a gain you simply can't get from training everything jointly Does training order reshape how models handle different task types?.

Why does order leave such a lasting mark? Partly because each phase of training narrows what later phases can still learn. Staying close to the base model's distribution preserves 'plasticity' — the capacity to keep adapting — while drifting far early on causes later learning to stall when the domain shifts Does staying close to the base model preserve learning ability?. Order can also be actively destructive: training on near-impossible problems early teaches degenerate shortcuts that then *contaminate* capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?, and RL tends to lock onto a single dominant format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. Sequence isn't neutral exposure — early steps set the channel the rest flows through.

The deeper twist is that order matters *inside* examples too, and here it matters more than content. Models trained on chain-of-thought tolerate 50% corrupted numbers but break when you shuffle the steps — what distills is the logical architecture, how steps sequence, not factual correctness What do models actually learn from chain-of-thought training?. The same lesson shows up where deliberately corrupted reasoning traces teach as well as correct ones, suggesting the trace works as computational scaffolding rather than meaningful content Do reasoning traces need to be semantically correct?. For in-context learning of sequential decisions, models need full trajectories from the same environment, not isolated examples — the ordered structure *is* the signal Why do trajectories matter more than individual examples for in-context learning?.

The thing you might not expect to want to know: a model's sensitivity to order is partly latent and switchable. By default LLMs largely *ignore* temporal order when ranking from interaction histories — yet a recency-focused prompt reactivates that order-sensitivity with no retraining at all Why do language models ignore temporal order in ranking?. So 'order matters' isn't one phenomenon but several: which examples come first reshapes internal dynamics and forecloses later learning, the ordering of steps within an example is the actual thing being taught, and whether a model even attends to order can be flipped on like a switch. (For the opposite extreme — how little ordered data you sometimes need — a single well-chosen example can unlock latent reasoning Can a single training example unlock mathematical reasoning?.)

Sources 11 notes

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

What do models actually learn from chain-of-thought training?

Controlled ablations show models tolerate 50% corrupted numbers (3.2% accuracy loss) but fail under step shuffling (13.3% loss). What distills across reasoning demonstrations is logical architecture—how steps sequence and connect—not factual accuracy.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Why do language models ignore temporal order in ranking?

LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Why does the order of training examples matter for what models learn?

Sources 11 notes

Next inquiring lines