Why do smaller and larger models converge on different output formats?

This explores why model size changes which output *shape* a model settles into — not whether it's right, but which of several learned formats wins out — and what that reveals about how scale interacts with training.

This explores why model size changes which output *shape* a model settles into — not whether it's right, but which of several formats wins. The corpus has a surprisingly direct answer, and it isn't about capability. Controlled experiments on RL post-training show that a model already carries *multiple* candidate formats from pretraining, and RL doesn't invent a new one — it amplifies a single dominant format within the first epoch and suppresses the rest. The striking part: which format wins depends on model scale, not on which format performs best Does RL training collapse format diversity in pretrained models?. So convergence on different formats at different sizes is largely an artifact of which latent format each scale happened to weight most heavily before training ever began.

Underneath that sits a difference in how probability mass is distributed. Larger models concentrate their probability on a few preferred outputs, which is why — counterintuitively — smaller models around 500M parameters generate *more* unique samples per draw Why aren't bigger models better for generating diverse outputs?. A peakier distribution doesn't just reduce diversity; it changes which single format dominates when training collapses the alternatives. Small and large models are effectively starting from different-shaped distributions, so the format that survives the collapse differs.

The deeper reframe is that output format and actual knowledge are *separable*. A 1.5B model with only LoRA post-training can match much larger RL-trained models on reasoning, which suggests RL mostly teaches the *organization* of the output rather than new facts lora-based-reasoning-format-adaptation-achieves-competitive-reasonin g-by-adaptin. If format is a relatively cheap, learnable layer sitting on top of knowledge, then it makes sense that it's the thing most sensitive to scale-dependent quirks — and the thing you can deliberately steer. DPO does exactly this for small models: feeding explicit wrong-vs-right examples fixes the rigid format failures that plain fine-tuning leaves behind Can small models match large models on function calling?.

There's a tension worth sitting with. Across many models, outputs tend to converge — an "artificial hivemind" where different systems independently produce near-identical responses because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. So at the *content* level scale pushes toward sameness, while at the *format* level scale pushes toward different attractors. The thing a curious reader walks away knowing: format isn't a window onto how smart a model is. It's a near-arbitrary winner of a collapse process, decided partly by size, and decoupled enough from knowledge that you can train it independently of what the model actually understands.

Sources 5 notes

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Why do smaller and larger models converge on different output formats?

Sources 5 notes

Next inquiring lines