Can closed-form solutions compete with gradient descent optimization?

This explores whether one-shot analytical answers (closed-form, the kind LLMs reach for by pattern-matching) can hold their own against iterative, gradient-style optimization that refines an answer step by step — and the corpus reframes the question around what happens inside language models.

This explores whether a model can just *recall the answer* (closed-form, one shot) instead of *working toward it* (iterative optimization) — and the surprising thread across this collection is that language models try the closed-form route by default, and it quietly fails. When you hand an LLM an optimization problem, it doesn't actually run the iterations in its head; it recognizes the problem as similar to ones it has seen and emits a plausible-looking answer Do large language models actually perform iterative optimization?. That looks like a closed-form shortcut, but it's really memorized template-matching, and it plateaus hard — around 55–60% constraint satisfaction no matter how big the model gets Do larger language models solve constrained optimization better?. Fine-tuning doesn't rescue it either: supervised training makes the *output* look correct without making it actually feasible Does supervised fine-tuning actually improve reasoning on optimization problems?.

The deeper reason cuts to architecture. Real optimization — whether gradient descent or a constraint solver — depends on *taking things back*: discarding a bad partial answer and trying again. Autoregressive generation can't retract a token once it's emitted, so it structurally lacks the one primitive that iterative search relies on Why does autoregressive generation fail at constraint satisfaction?. That's why a one-shot closed-form pass isn't just weaker here, it's missing the machinery to compete.

So what wins? Bringing the iteration back. Energy-Based Transformers reintroduce gradient descent *at inference time* — assigning an energy score to candidate predictions and minimizing it — and gain meaningfully on both training scaling and out-of-distribution generalization compared to standard transformers Can energy minimization unlock reasoning without domain-specific training?. Evolutionary search at inference does the same thing with a different engine: it keeps a diverse population of candidate solutions and mutates them, solving 98% of planning tasks and beating one-shot Best-of-N sampling precisely because it refuses to commit to a single trajectory Can evolutionary search beat sampling and revision at inference time?. Tree search (MCTS) lands in the same camp — iterating over solution paths rather than guessing one Can tree search replace human feedback in LLM training?.

The thing you might not have expected to learn: the contest isn't really "closed-form vs. gradient descent" as competing answers to the same math problem. It's that language models *pretend* to do closed-form and the corpus keeps showing the cure is to graft iterative optimization back on top of them — search, energy minimization, or an external symbolic solver. The closed-form instinct is the failure mode, not the rival.

Sources 7 notes

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can closed-form solutions compete with gradient descent optimization?

Sources 7 notes

Next inquiring lines