When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

Paper · arXiv 2505.11423 · Published May 16, 2025

Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the- art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, selfreflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instructionfollowing and offer practical mitigation strategies.

we evaluate 15 language models of varying sizes and training paradigms, including general-purpose models (e.g., Llama, Mixtral) and reasoning-tuned models (e.g., Claude 3.7, DeepSeek-R1), on two complementary instruction-following benchmarks: IFEval [Zhou et al., 2023] consists of prompts with simple, independently verifiable constraints (e.g., “write at least 400 words” or “mention AI three times”). In contrast, ComplexBench [Wen et al., 2024] includes instructions formed through compositional logic, combining multiple dependent constraints via operations like chaining, selection, and nesting.

We find that reasoning helps in two common scenarios: (1) satisfying formatting or structural requirements, and (2) enforcing lexical constraints that override default tendencies. However, it often hurts performance by: (1) over-focusing on high-level content and neglecting simple constraints, or (2) introducing redundant or well-intentioned content that unintentionally violates constraints.

We propose a quantitative measure: constraint attention, which is based on attention scores directed toward constraint tokens within instructions. Visualizations of attention patterns during response generation consistently demonstrate reduced constraint awareness when CoT prompting is employed, an effect observed across different datasets, models, and layers. This shift in attention might explain, at least partially, why reasoning can diminish instruction adherence.

Following these insights, we propose and evaluate four methods to mitigate the adverse impacts of reasoning on instruction-following accuracy: (1) In-context learning, where we identify and correct typical reasoning-induced failures, incorporating these examples into prompts; (2) Self-reflection, prompting models to evaluate and adjust their reasoning processes and candidate responses; (3) Self-selective reasoning, allowing models to autonomously decide when reasoning is beneficial; and (4) Classifier-selective reasoning, where a trained classifier determines when reasoning is necessary. Among these methods, we find that classifier-selective reasoning strategies offer substantial gains across benchmarks, while self-reflection is particularly helpful for larger models and simple instructions. We thoroughly discuss the comparative strengths and limitations of these mitigation strategies in Section 5.

Reasoning Helps ✓

Formatting and Structural Adherence: Reasoning improves compliance with structural constraints,

such as producing valid JSON, wrapping the output in double quotes, or following

markdown syntax.

Lexical and Keyword Precision: Reasoning enhances adherence to lexical requirements,

including inserting rare characters (e.g., the letter q six times), omitting final punctuation, or

using exactly 15 capitalized words.

Reasoning Hurts ✗

Over-Focusing on High-Level Content and Neglecting Simple Constraints: When multiple

constraints are present, reasoning often emphasizes content planning at the expense of

simpler mechanical constraints. Common issues include exceeding word count limits,

failing to repeat prompts exactly, using capital letters in lowercase-only tasks, or appending

unnecessary content after required phrases.

Based on our findings, we propose the following decision pipeline: first, estimate the complexity of the instruction—either through simple heuristics or a trained classifier. For simpler tasks (e.g., IFEval), we recommend Self-Reflection or Classifier-Selective Reasoning; for more complex or compositional tasks (e.g., ComplexBench), Self-Selective Reasoning or Classifier- Selective Reasoning is more effective. Overall, Classifier-Selective Reasoning consistently delivers the best overall performance across both benchmarks, albeit at the cost of model-specific training.

We use two benchmark datasets, IFEval and ComplexBench, to comprehensively evaluate the instruction-following capabilities of language models:

IFEval is a synthetic dataset consisting of 541 prompts, each associated with one to three verifiable constraints drawn from 25 types (e.g., word count, formatting, keyword usage). We adopt instructionlevel loose accuracy, which allows minor formatting deviations (e.g., markdown wrappers) to avoid penalizing responses for stylistic differences, thereby better reflecting practical robustness.

ComplexBench is a manually curated dataset designed to assess models on complex compositional instructions formed through four operations: And, Chain, Selection, and Nested. It contains 1,150 instructions and over 5,300 scoring questions, covering 4 constraint types across 19 dimensions (e.g., lexical, semantic, formatting). Evaluation combines rule-based and LLM-based assessments. For our experiments, we translated all scoring rules into English and manually verified them to construct a fully English-compatible version.