Does instruction tuning teach task understanding or output format?
Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.
"Do Models Really Learn to Follow Instructions?" creates two devastating controls. First, simplified task definitions that strip all semantic content, leaving only output space information (e.g., "output one of: A, B, C"). Second, delusive examples containing incorrect input-output mappings. Models trained on either achieve comparable performance to models trained on full, correct instructions. A random baseline achieves 42.6% exact-match versus instruction tuning's 43%.
The implication: instruction tuning primarily teaches the model to map its existing capabilities to the expected output format, not to understand or execute the task as described in the instruction. The semantic content of the instruction — what the task is, how to approach it, what constitutes a correct answer — appears largely irrelevant. What matters is the output distribution: how many classes, what format, what vocabulary.
This connects to a broader pattern. Does training data format shape reasoning strategy more than domain? showed a 7.5x stronger effect of format over domain. Can models pass tests while missing the actual grammar? showed that correct outputs can mask reliance on surface heuristics. The instruction tuning finding adds: even explicit instructions about the task are largely ignored in favor of format signals.
A complementary theory from "Are Emergent Abilities just ICL?" (2309.01809) provides the mechanistic explanation: instruction tuning enables "implicit in-context learning" — mapping instructions to the form required for ICL rather than creating new functional abilities. The evidence: purported emergent abilities are explained by a combination of in-context learning, model memory, and linguistic knowledge. The model's sensitivity to minor prompt variations and tendency to hallucinate are inconsistent with genuine emergent functional abilities but consistent with a model that maps prompts to ICL patterns. This reframes safety concerns: if prompts function as "training mechanisms" rather than interfaces to inherent abilities, the safety landscape changes — the risk is in what ICL patterns exist, not in what abilities have "emerged."
The IT Survey (same source) documents the concern from the other direction: "there has been an intense criticism that IT only captures surface-level patterns and styles rather than comprehending and learning the task." Combined with the False Promise finding that model imitation captures style not factuality, a clear pattern emerges: fine-tuning-based adaptation — whether through imitation, instruction tuning, or domain SFT — preferentially captures distributional and formatting information while leaving underlying capabilities largely unchanged. The capability bottleneck is in the base model, not the adaptation method.
Webson & Pavlick (2021) provide the prompting-level parallel. Evaluating 30+ manually written templates and 13 sets of target words across 390+ prompts, they find models learn identically fast from irrelevant or misleading templates as from instructive ones. Models are "much more sensitive to the choice of LM target words as opposed to the meaning of the instruction templates." Instruction-tuned models can be "too robust" — less sensitive to prompt semantics than non-IT equivalents, suggesting IT trains a form of prompt-blindness. This holds from 235M to 175B parameters. The convergence is striking: both the fine-tuning and the prompting literature arrive at the same conclusion from opposite directions — the semantic content of instructions is largely inert, and what transfers is format and output space information.
Source: Training Fine Tuning; enriched from LLM Architecture
Related concepts in this collection
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
format > domain at 7.5x; this adds format > instruction semantics
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same mechanism in linguistic domain
-
Can small models reason well by just learning output format?
Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
LoRA as format adapter aligns with IT as format teacher
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT raises accuracy because it teaches the output format, not because it improves reasoning
-
Why do chain-of-thought examples fail across different conditions?
Chain-of-thought exemplars show surprising sensitivity to order, complexity level, diversity, and annotator style. Understanding these brittleness dimensions could reveal what makes reasoning prompts robust or fragile.
complementary evidence of format-over-substance: IT achieves accuracy through format matching alone, while CoT exemplar brittleness shows reasoning performance depends on surface exemplar properties (order, style, complexity) rather than semantic content
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
instruction tuning teaches output format distribution not task understanding — simplified and delusive instructions achieve comparable performance