How do complexity and diversity affect model performance differently?
This explores how two different properties of training and reasoning — complexity (how hard or layered a problem is) and diversity (how varied the data or outputs are) — pull on model performance in opposite or unrelated directions, rather than being two flavors of the same 'difficulty' knob.
This reads the question as asking whether complexity and diversity are separate levers — and the corpus says emphatically yes, they act on different parts of performance and shouldn't be collapsed into one 'difficulty' score. The cleanest statement comes from work disentangling synthetic-data properties: quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both at once How do quality, diversity, and complexity affect synthetic data differently?. So complexity is a both-sides amplifier, while diversity is specifically what lets a model handle inputs unlike anything it trained on. The danger flagged there is that most evaluation crushes all three into a single quality metric — which is exactly why self-improvement loops quietly rot: they keep 'quality' up while bleeding diversity irreversibly.
The sharpest twist on complexity is that it may not be the real failure axis at all. When large reasoning models break, it's not when problems cross a complexity threshold — it's when they hit an instance they haven't seen before Do language models fail at reasoning due to complexity or novelty?. Models fit instance-level patterns rather than general algorithms, so a long, 'complex' chain succeeds fine if it resembles training instances, and a short 'simple' one fails if it's novel. That reframes complexity as often a proxy for novelty — which is really a diversity-of-exposure problem wearing a complexity costume.
Diversity, meanwhile, turns out to be fragile and direction-dependent in ways complexity isn't. Preference tuning reduces diversity in code (where convergence on the one correct answer is rewarded) but increases it in creative writing (where distinctiveness pays) — same procedure, opposite effect depending on domain Does preference tuning always reduce diversity the same way?. RL post-training collapses onto a single pretraining format within the first epoch, and the winning format tracks model scale, not performance Does RL training collapse format diversity in pretrained models?. And bigger isn't better for variety: ~500M-parameter models generate more unique samples per budget because large models pile probability mass on favorites Why aren't bigger models better for generating diverse outputs?.
The place the two levers visibly diverge is when models feed into search or selection at inference. There, training for diversity beats optimizing a single scalar score — varied-but-competent outputs let evolutionary search explore and recombine modes that an entropy-collapsed policy literally cannot reach Should training maximize diversity when models feed into search?. Critique-in-the-loop preserves that solution diversity during training itself, counteracting the tail-narrowing that otherwise sets in across self-training rounds Do critique models improve diversity during training itself?. But raw diversity is no free lunch: different models converge on near-identical answers anyway (the 'Artificial Hivemind'), so naive ensembling buys less variety than you'd hope Do different AI models actually produce diverse outputs?, and diversity only converts to better output when paired with genuine expertise or a verifiable selection signal — diverse-but-weak agents underperform a single competent one Does cognitive diversity alone improve multi-agent ideation quality?, When can weak models match strong model performance?.
The thing you didn't know you wanted to know: complexity mostly amplifies whatever generalization you already have, while diversity is the only lever that buys you *new* coverage — but diversity is the one that silently decays under almost every standard training objective (RLHF, RL, self-improvement), and it only pays off if something downstream — search, a soundness check, real expertise — can select the good modes out of the variety you preserved.
Sources 10 notes
Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Sampling alone amplifies coverage but cannot select correct solutions. Reliable performance matching requires external soundness signals—tests, proofs, or type checks—that convert latent correct proposals into actual selections.