LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does instruction tuning teach task understanding or output format?

Exploring whether models trained on instructions actually learn the task semantics or merely learn to match output distributions. This matters because it challenges assumptions about how fine-tuning improves model behavior.

Note · 2026-02-22 · sourced from Training Fine Tuning
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Do Models Really Learn to Follow Instructions?" creates two devastating controls. First, simplified task definitions that strip all semantic content, leaving only output space information (e.g., "output one of: A, B, C"). Second, delusive examples containing incorrect input-output mappings. Models trained on either achieve comparable performance to models trained on full, correct instructions. A random baseline achieves 42.6% exact-match versus instruction tuning's 43%.

The implication: instruction tuning primarily teaches the model to map its existing capabilities to the expected output format, not to understand or execute the task as described in the instruction. The semantic content of the instruction — what the task is, how to approach it, what constitutes a correct answer — appears largely irrelevant. What matters is the output distribution: how many classes, what format, what vocabulary.

This connects to a broader pattern. Does training data format shape reasoning strategy more than domain? showed a 7.5x stronger effect of format over domain. Can models pass tests while missing the actual grammar? showed that correct outputs can mask reliance on surface heuristics. The instruction tuning finding adds: even explicit instructions about the task are largely ignored in favor of format signals.

A complementary theory from "Are Emergent Abilities just ICL?" (2309.01809) provides the mechanistic explanation: instruction tuning enables "implicit in-context learning" — mapping instructions to the form required for ICL rather than creating new functional abilities. The evidence: purported emergent abilities are explained by a combination of in-context learning, model memory, and linguistic knowledge. The model's sensitivity to minor prompt variations and tendency to hallucinate are inconsistent with genuine emergent functional abilities but consistent with a model that maps prompts to ICL patterns. This reframes safety concerns: if prompts function as "training mechanisms" rather than interfaces to inherent abilities, the safety landscape changes — the risk is in what ICL patterns exist, not in what abilities have "emerged."

The IT Survey (same source) documents the concern from the other direction: "there has been an intense criticism that IT only captures surface-level patterns and styles rather than comprehending and learning the task." Combined with the False Promise finding that model imitation captures style not factuality, a clear pattern emerges: fine-tuning-based adaptation — whether through imitation, instruction tuning, or domain SFT — preferentially captures distributional and formatting information while leaving underlying capabilities largely unchanged. The capability bottleneck is in the base model, not the adaptation method.

Webson & Pavlick (2021) provide the prompting-level parallel. Evaluating 30+ manually written templates and 13 sets of target words across 390+ prompts, they find models learn identically fast from irrelevant or misleading templates as from instructive ones. Models are "much more sensitive to the choice of LM target words as opposed to the meaning of the instruction templates." Instruction-tuned models can be "too robust" — less sensitive to prompt semantics than non-IT equivalents, suggesting IT trains a form of prompt-blindness. This holds from 235M to 175B parameters. The convergence is striking: both the fine-tuning and the prompting literature arrive at the same conclusion from opposite directions — the semantic content of instructions is largely inert, and what transfers is format and output space information.


Source: Training Fine Tuning; enriched from LLM Architecture

Related concepts in this collection

Concept map
15 direct connections · 159 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

instruction tuning teaches output format distribution not task understanding — simplified and delusive instructions achieve comparable performance