Why do strong models struggle more with instruction following than mid-tier ones?

This explores a counterintuitive finding: training models to reason harder (the thing that makes them "strong") often makes them worse at actually following the instructions you gave them.

This explores a counterintuitive finding: the very training that makes a model a better reasoner can make it a worse listener. The corpus has a sharp answer, and it's not about model size in the way you'd expect — it's about a trade-off baked into how reasoning is trained. The MathIF and related work show that scaling reasoning capability through SFT and RL actively *degrades* instruction adherence, with advanced reasoning models dropping to around 50% compliance during math tasks Why do more capable reasoning models ignore your instructions?. The mechanism is almost spatial: the longer a model's chain-of-thought runs, the more "contextual distance" piles up between the original instruction and the point where the model finally acts — and that distance dilutes the model's attention to what you originally asked Why do better reasoning models ignore instructions?.

So "strong" here doesn't mean bigger — it means more heavily optimized for reasoning depth. And that reframes your question: the reason mid-tier models can look *better* at instruction-following is that they haven't been pushed into the regime where reasoning crowds out compliance. There's a neat illustration of this in how failure *shapes* differ by model class: small models degrade linearly as you pile on instructions, mid-range models degrade exponentially, but reasoning models hold steady up to ~150 instructions and then fail off a cliff How does instruction density affect model performance?. Different machinery, different breaking points — strength buys you a high plateau but a steeper fall.

The deeper and more unsettling thread in the corpus is that following an instruction and *understanding* it are separate circuits — and strength doesn't bind them together. Models exhibit a "knowing-doing gap": they articulate the correct principle ~87% of the time but actually act on it only ~64% of the time, a structural disconnect between the explanation pathway and the execution pathway rather than a knowledge deficit Can language models understand without actually executing correctly? Why do language models fail to act on their own reasoning?. A more capable reasoner gets *better* at the knowing half while the doing half lags — so the more you scale comprehension, the wider that visible gap can look. There's even a hint that instruction-tuning never taught "obey the instruction" in the first place: models trained on semantically empty or deliberately wrong instructions perform about as well as those trained on correct ones, suggesting what transfers is the *output format*, not the meaning of the instruction Does instruction tuning teach task understanding or output format?.

Two adjacent failures round out the picture and show this isn't unique to reasoning length. The same lock-in shows up in conversation: models make premature assumptions and can't recover, and notably this is a behavior *induced by RLHF* that rewards confident helpfulness over stopping to clarify — the more an aligned model is rewarded for charging ahead, the less it course-corrects to your actual intent Why do AI assistants get worse at longer conversations? Why do language models fail in gradually revealed conversations?. And when models have to actively interact rather than passively answer, even GPT-4o collapses to 35% on a simple interactive task, with SFT, DPO, and Tree-of-Thought barely moving the needle — evidence the deficit is structural, not a prompting fix Why do models fail at asking good questions during interaction?.

The thing you didn't know you wanted to know: "controllability" and "capability" may be in genuine tension, not just imperfectly aligned. Across these notes the pattern repeats — the optimizations that make a model smarter (longer reasoning, RLHF helpfulness, deeper comprehension) are often the *same* optimizations that make it stop attending to you. A strong model isn't ignoring you because it's dumb; it's ignoring you because it's been trained to trust its own elaboration over your instruction.

Sources 9 notes

Why do more capable reasoning models ignore your instructions?

Advanced reasoning models achieve only 50.71% instruction adherence during mathematical reasoning. Training for reasoning depth actively worsens instruction compliance, suggesting a fundamental trade-off between reasoning power and controllability.

Why do better reasoning models ignore instructions?

The MathIF benchmark shows that SFT and RL training improve reasoning but reduce instruction adherence, particularly as chain-of-thought length increases. Longer reasoning chains create contextual distance that dilutes the model's attention to original instructions.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do models fail at asking good questions during interaction?

GPT-4o achieves only 35% on interactive number guessing, with information gains collapsing from 7.7% to 2.5% as rounds progress. SFT, DPO, and Tree-of-Thought interventions provide minimal improvement, suggesting the deficit is structural rather than a prompting or fine-tuning problem.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether instruction-following deficits in reasoning-heavy models persist or have been overcome. The question: *why do strong models struggle more with instruction following than mid-tier ones?* This remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as time-stamped snapshots, not current baselines.
- Scaling reasoning via SFT/RL actively degrades instruction adherence; advanced reasoning models drop to ~50% compliance on math tasks (2025).
- Longer chain-of-thought creates "contextual distance" that dilutes attention to original instructions (2025).
- Models exhibit a "knowing-doing gap": articulate correct principles ~87% of the time but execute only ~64% of the time — a structural disconnect between explanation and execution pathways, not a knowledge deficit (2025).
- Instruction-tuning may teach output *format*, not meaning; models trained on semantically empty or wrong instructions perform as well as those trained on correct ones (2025).
- RLHF rewards confident helpfulness over clarification, inducing premature assumptions; even GPT-4o collapses to 35% on simple interactive tasks (2026).

Anchor papers (verify; mind their dates):
- arXiv:2305.11383 (2023-05): Do Models Really Learn to Follow Instructions?
- arXiv:2505.14810 (2025-05): Scaling Reasoning, Losing Control
- arXiv:2507.10624 (2025-07): Comprehension Without Competence
- arXiv:2602.07338 (2026-02): Intent Mismatch Causes LLMs to Get Lost

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether newer model releases, post-training methods (DPO variants, novel RL objectives), architectural changes (sparse routing, modular reasoning), or evaluation harnesses have since RELAXED or OVERTURNED the deficit. Separate the durable question (does capability-controllability tension exist?) from perishable limitations (does ~50% compliance still hold for latest reasoning models?). Cite what resolved it; flag where constraints remain.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months — papers showing instruction-following *improves* with scale, or reasoning + compliance co-scale, or architectural tricks that sever the trade-off.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., *have post-training interventions (auxiliary losses, instruction-aware RL) finally decoupled reasoning depth from listening deficit?* or *do agentic architectures (critic loops, clarification agents) move the interaction paradigm beyond passive compliance?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why do strong models struggle more with instruction following than mid-tier ones?

Sources 9 notes

Next inquiring lines