How do output format constraints compare to input exemplar brittleness?

This explores two ways prompt *surface* sabotages model substance — squeezing the output into a rigid format vs. depending on hand-picked input examples — and asks whether the corpus sees them as the same underlying problem or two different ones.

This explores two ways the shape of a prompt can quietly wreck what the model actually does: constraining the *output* (forcing JSON, a schema, a fixed template) versus depending on *input* examples (the few-shot demonstrations you paste in to steer it). The corpus treats both as symptoms of one deeper fact — for these models, form and content compete for the same limited budget, and neither side of the prompt is as stable as it looks.

On the output side, strict formatting measurably eats reasoning. When a schema is imposed, accuracy drops across multiple models, and loosening the format — keeping the type but dropping the rigid schema — recovers most of what was lost, which suggests compliance and reasoning are drawing from the same well Do strict output formats hurt LLM reasoning ability?. There's an even sharper version of this: models trained to hide their chain-of-thought actually compute the right answer in their early layers, then *overwrite* it in the final layers to emit format-compliant filler tokens. The reasoning is still recoverable underneath — the format requirement literally buries it Do transformers hide reasoning before producing filler tokens?.

On the input side, exemplars turn out to be brittle along four separate axes at once — reorder them and you get 3.3% swings, mismatch their complexity to the problem, give them no diversity, or just have a different person write them, and you see up to 28.2% variance. These compound, which is why hand-curating examples never transfers cleanly across tasks Why do chain-of-thought examples fail across different conditions?. The unsettling part is *why* this works at all: logically invalid reasoning examples perform nearly as well as valid ones, because the model is copying the *form* of reasoning, not the inference Does logical validity actually drive chain-of-thought gains?. So exemplars don't teach the model to think — they configure a surface pattern, and that pattern is fragile to cosmetic change.

Here's the lateral payoff: both failures are the same shape seen from two ends. Output constraints hurt because the model spends generation capacity on form instead of thought; input exemplars are brittle because they were only ever transmitting form in the first place. The reason neither is obvious from a benchmark is that surface success routinely masks broken substance — models can hit perfect accuracy while their internal representations are fractured and won't survive a perturbation or a distribution shift Can models be smart without organized internal structure?, and many models that look like they're reasoning about constraints are really just defaulting conservatively, scoring *worse* when the constraint is removed Are models actually reasoning about constraints or just defaulting conservatively?.

The thing you didn't know you wanted to know: format and exemplars aren't separate prompt-engineering knobs at all. Both are levers on the gap between what a model *displays* and what it *computes* — and the corpus suggests the more rigidly you control the display, on either the input or output end, the more you risk paying for it in the computation you actually wanted.

Sources 6 notes

Do strict output formats hurt LLM reasoning ability?

Schema-specific format requirements cause measurable reasoning decline across multiple models. Removing schema constraints while keeping loose format type recovers most lost performance, suggesting format compliance and reasoning compete for the model's generation capacity.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do chain-of-thought examples fail across different conditions?

Human-written CoT exemplars degrade performance when reordered (3.3% swings), mismatched to problem complexity, lacking diversity, or written by different annotators (up to 28.2% variance). These four dimensions compound, making manual exemplar curation unreliable across tasks.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

How do output format constraints compare to input exemplar brittleness?

Sources 6 notes

Next inquiring lines