Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
When evaluated on real constrained-optimization problems — optimal power flow, financial portfolio constraints, cyber-security feasibility — LLMs cluster around 55-60% constraint satisfaction across virtually all conditions tested. The plateau is robust to changes in architecture, parameter count, and training regime. Reasoning models, despite extended chain-of-thought, do not systematically beat their non-reasoning counterparts on these tasks.
The flatness of the plateau is the finding. Most LLM capability work assumes that the relevant axis is performance vs scale, and that closing a gap is a matter of training on more or better data. Constrained optimization does not behave that way. The benchmark distinguishes problems that require jointly interpreting structured input, doing multi-step arithmetic, satisfying interacting physical constraints, and converging to feasible solutions. On the joint task, the model class itself appears to be near a ceiling.
This is distinct from general reasoning benchmarks (MMLU, GPQA) and from logical reasoning benchmarks (ARC-AGI, SATBench, ZebraLogic). Those measure either broad knowledge or synthetic constraint puzzles. Real engineering optimization requires the model to execute iterative numerical procedures over physical constraints, and that procedural execution is where the plateau lives.
The deployment implication is sharp: telling executives that "LLMs will optimize the grid" or "LLMs will solve constrained portfolio problems" is currently an overclaim. The same finding suggests the productive direction is not "wait for the next model" but "change the paradigm" — restrict the LLM to abstraction tasks and hand numeric work to solvers.
Related concepts in this collection
-
Do reasoning models actually beat standard models on optimization?
Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.
same paper, the reasoning-model specific finding
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism for the plateau
-
Should LLMs handle abstraction only in optimization?
What if LLMs worked exclusively on translating problems to formal constraints, while deterministic solvers handled the numeric work? Explores whether this division of labor could overcome LLM failures in iterative computation.
same paper, the proposed solution
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
adjacent: chain-of-thought has its own ceiling
-
Can large language models translate natural language to logic faithfully?
This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.
adjacent: NL → formal translation limits
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
LLMs plateau at 55 to 60 percent constraint satisfaction on genuine optimization regardless of scale architecture or training