Why do production systems optimize for three model classes instead of foundation models?
This reads as: why do teams shipping real products reach for several specialized model classes — small generators, preference/DPO-tuned models, inference-scaled reasoners — rather than leaning on one big foundation model to do everything; the corpus doesn't name exactly 'three classes,' but it does explain why bigger-and-singular loses to specialized-and-plural in production.
This explores why production teams split work across specialized model classes instead of betting on one large foundation model — and the corpus keeps landing on the same answer: scale buys less than you'd expect, and specialization buys more. The cleanest evidence is that bigger is often actively worse for specific jobs. For generating diverse outputs, models around 500M parameters produce more unique samples per budget than large ones, because large models concentrate probability mass on their few preferred answers Why aren't bigger models better for generating diverse outputs?. And small models tuned with DPO on a teacher's correct/incorrect examples can match large models on function-calling and reasoning, precisely because explicit negative examples fix the rigid-format failures that plain scale doesn't Can small models match large models on function calling?.
The second pressure is that foundation-model scale hits ceilings that more parameters don't move. On genuine constrained optimization, LLMs plateau at ~55–60% constraint satisfaction regardless of architecture, parameter count, or training regime — a wall, not a scaling gap Do larger language models solve constrained optimization better?. Reasoning variants with extended chain-of-thought don't systematically beat standard models on these numerical tasks; they produce more text, not more computation Do reasoning models actually beat standard models on optimization?. Underneath, foundation models tend to learn task-specific heuristics rather than the unified world models you'd hope to reuse everywhere Do foundation models learn world models or task-specific shortcuts?, and RL fine-tuning often sharpens template-matching memorization rather than installing transferable procedures Do fine-tuned language models actually learn optimization procedures?. A model that's secretly a bag of slice-dependent shortcuts is a poor single foundation to build a whole product on.
The third pressure is that you can trade compute against size at the boundary, which makes a portfolio of model classes economically rational. Inference-time compute substitutes for parameter scaling on hard prompts — a smaller model thinking longer can match a bigger one — which means pretraining and inference budgets aren't independent levers Can inference compute replace scaling up model size?. So a production stack can route easy work to a cheap small model, hard work to a small model with more inference compute, and reserve heavy training only where it pays.
There's a quieter reason hiding here too: post-training does different, sometimes opposite, things by domain. RLHF reduces lexical-syntactic diversity in code but increases it in creative writing, because each domain rewards a different shape of output Does preference tuning always reduce diversity the same way?, and RL tends to collapse onto a single dominant pretraining format while suppressing alternatives Does RL training collapse format diversity in pretrained models?. A single foundation model tuned one way can't be simultaneously convergent for code and divergent for prose — so teams keep distinct tuned classes rather than one compromised generalist. The thing you may not have expected: the case for specialized model classes isn't mostly about cost. It's that 'one big model' quietly fails at diversity, hits hard accuracy ceilings, and gets pulled in contradictory directions by its own tuning — and even when you do want one foundation, it heightens rather than removes your need for real empirical data to keep it honest Do foundation models actually reduce our need for real data?.
Sources 10 notes
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.
Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Powerful foundation models don't eliminate the need for real data—they heighten it. Without empirical anchoring, iterative prompt refinement creates epistemic circularity where users confirm their own beliefs rather than test them.