Do reasoning models actually beat standard models on optimization?
Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.
Reasoning models have been treated as a generalized capability upgrade — more thinking tokens at test time, broadly better performance. On constraint-bound numerical optimization the upgrade does not materialize. Reasoning variants do not systematically outperform their non-reasoning counterparts on power-grid, financial-operations, or cyber-security feasibility problems. The longer trace does not become a longer iteration.
The reason this matters: extended chain-of-thought looks like it should help. The problem involves multi-step arithmetic, interacting constraints, and convergence-style reasoning — exactly the regime where "think more" is supposed to pay. The data say it does not. Whatever extended CoT is doing on these tasks, it is not running a Newton-Raphson iteration or a primal-dual update in latent space; it is producing more text without producing more computation.
This is consistent with a growing view that reasoning models excel where the bottleneck is exploration over reasoning paths (math contests, code, multi-hop QA) but stall where the bottleneck is numeric procedure. Constraint satisfaction over real physical systems is the latter. Adding chain length adds search over verbal restatements of the problem, not iterations of the algorithm that would solve it.
The implication for product: choosing "reasoning model" for an optimization-heavy workflow is not automatically the right call. The relevant decision is whether the bottleneck is verbal reasoning or numeric computation. If numeric, the cost-effective path is hand-off to a solver, not more thinking tokens.
Related concepts in this collection
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the parent finding
-
Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
same paper, the mechanism
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
adjacent: CoT ceilings in general
-
Why does chain of thought accuracy eventually decline with length?
Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
adjacent: more thinking is not monotonically better
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
reasoning models do not systematically outperform non-reasoning models on real numerical optimization — extended chain-of-thought is not a substitute for iterative computation