What output distribution properties make smaller models better for wide sampling?

This explores why a model's output distribution — how it spreads probability across possible next tokens — determines whether it's good for generating many varied samples, and why smaller models tend to win at that job.

This explores the distribution-shape question behind a counterintuitive result: when you want lots of *distinct* outputs from a fixed sampling budget, a small model can beat a big one. The corpus's clearest answer is about where probability mass sits. Larger models concentrate mass on a few preferred continuations — they're confident, so their samples cluster. Smaller models, around 500M parameters, keep a flatter distribution and so cough up more unique outputs per sample, which is exactly what you want for synthetic data generation where coverage matters more than any single best answer Why aren't bigger models better for generating diverse outputs?.

The interesting part is that this isn't only about size — it's about anything that sharpens the distribution. Post-training is a mass-concentrating force. RL converges on a single dominant format inherited from pretraining within the first epoch, actively suppressing the alternatives, and which format wins depends on model scale rather than which one performs best Does RL training collapse format diversity in pretrained models?. So a heavily RL-tuned large model is doubly peaked: big to begin with, then collapsed onto one mode. That's the opposite of what wide sampling needs.

But the diversity story has a twist worth knowing: tuning doesn't always reduce variety. Preference tuning cuts lexical diversity in code generation — where there's a right answer to converge on — yet *increases* it in creative writing, where the reward favors distinctiveness Does preference tuning always reduce diversity the same way?. So 'flatter distribution = better for sampling' is domain-conditional. What you're sampling for decides whether mass concentration helps or hurts.

There's also a parallel argument from the inference-compute side that reframes why wide sampling pays off at all. Smaller models with more inference compute can match larger ones on hard prompts Can inference compute replace scaling up model size?, and reasoning systems gain efficiently by sampling many parallel trajectories rather than only going deeper — independent paths explore the solution space without inflating variance Can reasoning systems scale wider instead of only deeper?. A model whose distribution spreads across plausible paths is the right substrate for that width-wise scaling; a peaky one keeps drawing the same trajectory.

The practical payoff: the case for small models in agent pipelines isn't just cost. Small models handle most well-defined subtasks at a fraction of the price Can small language models handle most agent tasks?, and their broader output distributions make them the better generators when you need many candidate samples to filter — say, distilling preference pairs from varied attempts Can small models match large models on function calling?. The thing you didn't know you wanted to know: 'better sampler' and 'better single-shot answerer' pull in opposite directions on the distribution, and the property that makes a model great at one can make it worse at the other.

Sources 7 notes

Why aren't bigger models better for generating diverse outputs?

Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

What output distribution properties make smaller models better for wide sampling?

Sources 7 notes

Next inquiring lines