How Many Instructions Can LLMs Follow at Once?

Paper · arXiv 2507.11538 · Published July 15, 2025
FlawsPrompts PromptingInference time scaling

Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities have not yet been characterized, as existing benchmarks only evaluate models on tasks with a single or few instructions. We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases. We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions. Our analysis reveals model size and reasoning capability to correlate with 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors. Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance latency tradeoffs.

As large language models (LLMs) are increasingly being deployed in production systems requiring precise specification adherence, understanding their limitations is essential for reliable operation (Ouyang et al., 2022; Sanh et al., 2022; Wei et al., 2022; Song et al., 2025). From content generation systems that must adhere to style guidelines and factual requirements, to automated workflows that integrate dozens of business rules and compliance standards, to agentic systems requiring robust memory layers and tool usage, modern applications demand models that can execute complex tasks while satisfying multiple simultaneous instructions (de Langis et al., 2024; Kulkarni, 2025; Xu et al., 2025). Real-world failures, such as chatbots inventing non-existent policies or providing misleading advice, highlight the operational and legal risks of imperfect instruction following.

This challenge has become pressing as recent advances dramatically expand what we can feasibly ask models to handle. Context windows have grown from thousands to millions of tokens (Team, 2024), and reasoning capabilities over extended contexts have improved (OpenAI, 2024; DeepSeek-AI, 2025). These developments theoretically enable single-call requests with many simultaneous instructions, rather than the standard paradigm requiring careful decomposition or retrieval (Chung et al., 2025; Chan et al., 2025; Maamari et al., 2024). To confidently move towards increased instruction density, we must first answer: how many instructions can models actually handle before performance meaningfully degrades?

Threshold decay: Performance remains stable until a threshold, then transitions to a different (steeper) degradation slope and displays increased variance. The top two models (gemini-2.5-pro, o3) demonstrate this clearly, maintaining near-perfect performance through 150 or more instructions before declining. Notably, these are both reasoning models, indicating that deliberative processing architectures provide robust instruction tracking up to critical thresholds, beyond which systematic degradation occurs.

Primacy effects display an interesting pattern across all models: they start low at minimal instruction densities indicating almost no bias for earlier instructions, peak around 150-200 instructions, then level off or decrease at extreme densities. This mid-range peak suggests that models exhibit the most bias as they begin to struggle under cognitive load at moderate densities. However, at extreme densities (300+ instructions), primacy effects uniformly diminish across all models, with most ratios converging toward 1.0–1.5 (see Appendix A and Fig. 3). This convergence indicates a shift from selective instruction satisfaction to more uniform failure patterns when completely overwhelmed, indicating an instruction saturation point. Therefore, while packing more important instructions towards the beginning of a prompt may help, it becomes a less effective strategy once extreme densities are reached.

We evaluate two types of instruction violations:

Omission errors: Complete failure to include required terms in the generated text. For example, when instructed to include "accountability" but the term appears nowhere in the output.

Modification errors: Inclusion of morphological variants rather than exact required terms. For example, including "accountable" or "accounts" when "accountability" was required, or "strategic" when "strategy" was required.