LLM Reasoning and Architecture Reinforcement Learning for LLMs

Why do better reasoning models ignore instructions?

As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

As models get better at reasoning, they get worse at doing what you ask. This is not a tuning oversight — it appears to be a structural consequence of how reasoning capability is developed.

The MathIF benchmark evaluates instruction-following in mathematical reasoning tasks — not just whether the answer is correct, but whether the model complied with specified constraints while solving. The findings:

The mechanism: longer reasoning chains create a contextual gap between the original instruction and the final answer. The instruction is specified at the beginning of the context. The answer emerges at the end of a long chain. As the chain grows, the model's attention to the original instruction diminishes — the reasoning process itself dilutes the directive.

The trade-off is explicit: enforcing brevity by limiting CoT length recovers instruction-following performance, but at the cost of reasoning depth and accuracy.

This creates an alignment problem that is distinct from the sycophancy or values-misalignment literature. The model is not disagreeing with instructions — it is forgetting them while reasoning. The longer it thinks, the more thoroughly it has moved on.

The practical implication: for instruction-critical applications (task management, decision support, customer-facing agents), high-capability reasoning models may perform worse on the compliance dimension that matters most. The "upgrade to a stronger model" instinct may backfire.

Constraint attention as measurable mechanism (When Thinking Fails): A 15-model evaluation across IFEval (simple rule-verifiable constraints) and ComplexBench (compositional constraints) confirms the degradation across model families and sizes. The proposed mechanism is "constraint attention" — attention scores directed toward constraint tokens in the instruction. When CoT prompting is applied, constraint attention measurably decreases: the reasoning process diverts the model's attention from instruction-relevant tokens toward content planning.

CoT helps in two cases: (1) satisfying formatting/structural requirements, and (2) enforcing lexical constraints that override default tendencies. It hurts in two cases: (1) over-focusing on high-level content while neglecting simple constraints (word counts, case requirements), and (2) introducing well-intentioned content that unintentionally violates constraints.

Four mitigation strategies were evaluated: in-context learning (corrective examples), self-reflection (model evaluates its own response), self-selective reasoning (model decides when to reason), and classifier-selective reasoning (trained classifier decides). Classifier-selective reasoning consistently delivered the best performance across both benchmarks — the model needs an external signal for when to reason, because its own judgment about when reasoning helps is unreliable.

LogicIFEval (2025) extends the deficit to logic-rich instructions specifically. When instructions contain rich logic structures — conditionals, nesting, recursion, function calls — most popular LLMs correctly follow fewer than 60% of instructions. Open-source models lag significantly behind frontier models. As logical complexity increases, accuracy drops further. An important asymmetry: incorporating explicit thinking (extended reasoning) before responding enhances instruction-following for large LLMs but NOT for smaller LLMs, suggesting the thinking-instruction interaction depends on model scale. This adds a second dimension to the instruction-following deficit: not just that reasoning training degrades compliance, but that logic-rich instructions are inherently harder to follow even without reasoning training. Source: Arxiv/Evaluations.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
19 direct connections · 181 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

scaling reasoning capability creates an instruction-following deficit — a fundamental training trade-off