Do instruction-tuned models learn tasks or just output format distributions?

This explores a provocative claim — that 'instruction tuning' may teach models *where* to put their answer (the output space) rather than *how* to actually do the task — and what the corpus says about how deep that goes.

This explores whether instruction-tuned models genuinely understand the tasks they're given, or mostly learn the *shape* of the expected output. The corpus has a startlingly direct answer to the literal version of the question, and then a set of adjacent findings that suggest the same pattern shows up at every stage of post-training.

The sharpest evidence is a study where models trained on semantically empty or even deliberately *wrong* instructions performed about as well as models trained on full correct ones — 43% versus a 42.6% baseline Does instruction tuning teach task understanding or output format?. In other words, the content of the instruction was nearly irrelevant; what transferred was knowledge of the output space. A related finding makes the same point from the other direction: aligned models like Llama-3-Instruct will auto-regressively generate high-quality instructions when fed *only* the pre-query formatting tokens, no prompt needed Can aligned LLMs generate their own training data?. The format scaffolding alone is doing remarkable work — which is exactly what you'd expect if format, not task semantics, is what tuning installs.

What's striking is how this 'format over substance' pattern recurs beyond supervised instruction tuning. In reinforcement learning, RL doesn't teach new behavior so much as amplify one dominant format already latent in pretraining while suppressing the alternatives — and which format wins depends on model *scale*, not on which performs best Does RL training collapse format diversity in pretrained models?. And when researchers probe whether RL installs genuine reasoning procedures, models that look strong in-distribution collapse on slight out-of-distribution variants, revealing they've sharpened template-matching rather than learned a procedure Do fine-tuned language models actually learn optimization procedures?. The same story appears in latent computation: asked to run iterative numerical methods, models recognize a problem as template-similar and emit plausible-but-wrong values instead of actually executing the steps Do large language models actually perform iterative optimization?.

But the corpus doesn't let 'it's all just format' stand as the final word — it shows where the format/task boundary gets attacked. DPO works for small models precisely *because* it supplies explicit negative examples that target rigid output-format failures SFT can't fix Can small models match large models on function calling?. Checklist-based rewards decompose instruction quality into verifiable sub-criteria, which reduces overfitting to the superficial artifacts that fool holistic reward models — an attempt to make the signal reward substance over surface Can breaking down instructions into checklists improve AI reward signals?. And the fragility of pure instruction-following shows up directly: as you stack instructions, compliance degrades predictably, with even the best models hitting only 68% at high density How does instruction density affect model performance?.

The thing you didn't know you wanted to know: the question isn't really 'tasks *or* format' — it's that format-learning is the cheap, default thing post-training installs at every stage, and the interesting research is about which training signals (explicit negatives, decomposed verifiable rewards) can force a model past surface mimicry into something that survives an out-of-distribution test.

Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Do instruction-tuned models learn tasks or just output format distributions?

Sources 8 notes

Next inquiring lines