Why do smaller models favor code formats while larger models prefer natural language?
This explores whether the corpus explains why small models lean on structured/code-like outputs while large models do better with free-form natural language — and the honest answer is that no note tackles this head-on, but several circle the same territory: small models' relationship to rigid format.
This reads the question as asking about a capacity difference — why structure helps the small and constrains the large — and the corpus doesn't have a paper aimed squarely at that comparison. What it does have is a cluster of findings about *why small models cling to format in the first place*, which is the more interesting half of the story.
The strongest thread is that small models struggle most precisely where rigid output structure is required, and that the fix is teaching them the shape, not the meaning. Small models fine-tuned with DPO on correct-vs-incorrect function-calling examples beat plain supervised fine-tuning because the negative examples directly target *format* failures — the model learns the rigid schema it otherwise fumbles Can small models match large models on function calling?. Read alongside the finding that small models are genuinely sufficient for the repetitive, well-defined slices of agent work Can small language models handle most agent tasks?, a picture emerges: code-like formats are a *scaffold*. A constrained output space (call this function, fill these fields) does the structuring work the small model can't generate on its own.
Why might that scaffold matter less — or even get in the way — at larger scale? Two notes hint at the mechanism. MobileLLM shows tiny models gain accuracy from being deep-and-thin, composing abstract concepts up through layers rather than spreading capacity across width Does depth matter more than width for tiny language models? — abstraction lives in depth, and small models have little of it to spare. And the logit-lens work shows models compute their actual answer in early layers, then spend the final layers *suppressing* that representation to emit format-compliant tokens Do transformers hide reasoning before producing filler tokens?. Format compliance, in other words, is a tax paid in the late layers — a tax a large, abstraction-rich model can afford to skip in favor of open-ended language, but one a small model is happy to pay because the structure substitutes for reasoning it doesn't have.
There's a cautionary undercurrent worth knowing about. When models *appear* to reason inside a rigid format, they're often exploiting the format rather than thinking — defaulting to conservative or template-matched answers that look structured but aren't Are models actually reasoning about constraints or just defaulting conservatively?, or emitting plausible-looking values for problems they recognize by template without actually solving them Do large language models actually perform iterative optimization?. So a small model's preference for code formats may be less a *strength* than a tell: structure is where pattern-matching can masquerade as competence, which is exactly where a model short on real abstraction would gravitate.
If you want the direct empirical comparison — small-favors-code vs. large-favors-language, measured — this collection doesn't have it. But it gives you the better question underneath: format isn't a stylistic preference, it's a proxy for how much abstraction a model can hold, and the smaller the model, the more the structure is doing the thinking for it.
Sources 6 notes
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.