Can optimization algorithms exploit the shift between procedural and planning bottlenecks?

This explores a split the corpus keeps circling: whether wrapping an LLM in a smarter optimization algorithm helps depends on which kind of bottleneck you're hitting — the *planning* kind (searching a space of possible solutions) or the *procedural* kind (grinding through an iterative numerical computation step by step).

This explores a split the corpus keeps circling: whether wrapping an LLM in a smarter optimization algorithm helps depends on which kind of bottleneck you're hitting. When the hard part is *planning* — searching through a wide space of possible solution paths — externalized algorithms pay off dramatically. When the hard part is *procedure* — actually executing an iterative numerical method — they don't, because the model can't do the inner work no matter how you orchestrate it.

On the planning side, the wins are striking. Evolutionary search with LLM-generated mutations solves 98% of planning tasks by keeping a diverse population alive instead of refining a single guess, beating both Best-of-N sampling and sequential revision Can evolutionary search beat sampling and revision at inference time?. Separating a 'decomposer' from a 'solver' improves accuracy because planning skill turns out to transfer across domains while solving skill does not Does separating planning from execution improve reasoning accuracy?. Treating an agent as an optimizable computational graph lets you tune both the prompts and the wiring between steps automatically Can we automatically optimize both prompts and agent coordination?, and embedding the LLM inside an explicit algorithm that hides step-irrelevant context turns a messy reasoning task into debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. All of these are algorithms exploiting a *planning* bottleneck — restructuring the search so the model's strength (proposing plausible moves) is used and its weakness (holding everything at once) is routed around.

But several notes show the opposite wall, and it doesn't move. On genuine constrained optimization, LLMs plateau at ~55–60% constraint satisfaction regardless of scale, architecture, or training — a ceiling, not a gap you can close with a bigger model Do larger language models solve constrained optimization better?. The mechanism is in the mode of failure: models recognize a problem as template-similar and emit plausible-but-wrong numbers instead of running the iterative method Do large language models actually perform iterative optimization?. Extended reasoning doesn't rescue this — it produces more text, not more iterative computation, which is why reasoning variants show no consistent edge on numeric tasks like optimal power flow Do reasoning models actually beat standard models on optimization?. This is the *procedural* bottleneck, and no amount of clever orchestration around the model fixes an inner step the model literally can't perform.

So the honest answer to the question is: an optimization algorithm exploits the *shift* by recognizing which side it's on. Where the bottleneck is planning, it should externalize the search — evolve candidates, decompose, optimize the graph — and it will win. Where the bottleneck is procedure, the right algorithmic move is to *not* ask the LLM to compute at all, but to hand the numeric kernel to a real solver and let the model do the planning around it. The interesting and slightly uncomfortable corner here is Can recursive subtask trees overcome context window limits? and Can non-reasoning models catch up with more compute?: structure and training *can* make extra tokens productive for reasoning — meaning the procedural ceiling may be specific to numerical execution in latent space, not a blanket limit on everything that looks like 'iteration.' The shift the question names isn't one line; it's a map of where orchestration helps and where it's just rearranging deck chairs.

Sources 9 notes

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can optimization algorithms exploit the shift between procedural and planning bottlenecks?

Sources 9 notes

Next inquiring lines