How does instruction density affect model performance?
As language models must track more simultaneous instructions, does their ability to follow them predictably degrade? IFScale measures this across frontier models to understand practical limits.
Production LLM systems routinely require adherence to dozens or hundreds of simultaneous instructions — style guidelines, business rules, compliance standards, tool usage protocols. IFScale measures how performance degrades as instruction density increases using 500 keyword-inclusion instructions for a business report writing task.
Key findings across 20 SOTA models from 7 providers:
Three degradation patterns correlate with model size and reasoning capability:
- Linear decay — steady degradation from the start (smaller models)
- Exponential decay — accelerating degradation as density increases (mid-range models)
- Threshold decay — near-perfect performance maintained until a threshold, then steep decline (reasoning models: gemini-2.5-pro, o3 maintain through ~150 instructions)
Primacy effects follow a non-obvious pattern: minimal bias at low density, peak at 150-200 instructions (where models begin to struggle), then converge toward 1.0 at extreme density (300+). The convergence indicates a shift from selective instruction satisfaction to uniform failure — an "instruction saturation point" where the model is completely overwhelmed.
Two error types: omission errors (complete failure to include required terms) and modification errors (morphological variants like "accountable" when "accountability" was required). The distinction has practical implications for prompt design — models may recognize the concept but fail at exact specification.
Even the best frontier models achieve only 68% accuracy at maximum density. Deliberative processing architectures (reasoning models) provide robust tracking up to critical thresholds, extending the useful range significantly but not eliminating the ceiling.
Source: Flaws
Related concepts in this collection
-
Why do better reasoning models ignore instructions?
As models develop stronger reasoning abilities through training, they appear to become worse at following specified constraints. Is this an unavoidable trade-off, and what causes it?
the training-time trade-off: reasoning scales while instruction-following degrades; IFScale quantifies the instruction-following dimension at inference time
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
instruction density degradation may partly be an input-length effect with instruction-specific characteristics
-
Do strict output formats hurt LLM reasoning ability?
When LLMs must produce structured JSON or XML with specific schemas, does this constrain their capacity for complex reasoning? This matters because production systems often enforce strict formats for parsing convenience.
format constraints are one type of instruction; IFScale generalizes to arbitrary instruction density
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
instruction following performance degrades predictably with instruction density — reasoning models show threshold decay at 150 instructions