Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models

Paper · arXiv 2505.14810 · Published May 20, 2025

Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models.

although LRMs excel at mathematical reasoning, they often fail to follow even simple instructions. This raises an important question: As reasoning scales, do models become more intelligent yet less controllable? Unfortunately, existing instruction-following benchmarks are ill-suited for answering this question. Most are designed

Surprisingly, most models fail to reliably follow instructions, and performance does not consistently improve with larger model sizes. Even the best-performing model, Qwen3-14B, achieves only 50.71% accuracy on strict instruction-following. Furthermore, performance deteriorates with increasing task difficulty and constraint complexity, revealing substantial headroom for improvement.

Our deeper analysis further uncovers a mutual interference between instruction-following and reasoning capabilities, observed at both training and inference stages. Common reasoning-oriented training strategies (e.g., SFT and RL) enhance reasoning ability but degrade instruction adherence. This degradation becomes more pronounced as the CoT length increases, likely because longer reasoning paths widen the contextual gap between the original instruction and the final answer, making it harder for the model to retain and execute directives. Conversely, enforcing brevity by limiting CoT length improves instruction-following performance, but at the cost of reasoning depth and accuracy.

These observations reveal a consistent pattern: improving reasoning capability often comes at the cost of instruction adherence, suggesting an inherent trade-off between the two abilities. This trade-off highlights a crucial challenge in LRM development: training for stronger reasoning alone may undermine alignment, and future methods must account for this tension to build models that are both capable and controllable.