Can language models execute iterative numerical methods in latent space?
This explores whether LLMs can genuinely run step-by-step numerical procedures inside their hidden 'thinking' layers — or whether they only look like they're computing while actually doing something else.
This explores whether LLMs can genuinely run step-by-step numerical procedures inside their hidden activations (latent space), rather than just produce answers that resemble the output of such procedures. The corpus has a sharp, direct answer: no — and the reason is more interesting than the verdict. Research finds that when you hand an LLM an optimization problem, it doesn't iterate toward a solution the way Newton's method or gradient descent would. Instead it recognizes the problem as template-similar to things it has seen, and emits plausible-looking but wrong values — a failure that doesn't go away as models get bigger or training improves Do large language models actually perform iterative optimization?.
That single finding sits inside a larger pattern the collection keeps surfacing: LLMs pattern-match where we expect them to compute. On genuine constrained-optimization tasks, models plateau at roughly 55–60% constraint satisfaction no matter the architecture, parameter count, or whether they're billed as 'reasoning' models — a ceiling, not a gap you can scale your way out of Do larger language models solve constrained optimization better?. And you can predict this in advance: if you treat an LLM as an autoregressive probability machine rather than a calculator, the tasks it fails are exactly the low-probability ones — counting letters, reversing the alphabet — that are logically trivial but statistically rare Can we predict where language models will fail?. Iterative numerical work is the same kind of trap: easy to state, but it requires actual procedure-following rather than recall.
What makes this more than a 'models are bad at math' story is that the same limitation shows up far from arithmetic. Models misparse deeply nested grammatical clauses, degrading predictably as structural depth grows — surface statistics, not deep rules Why do large language models fail at complex linguistic tasks?. And long-context models can match retrieval systems on semantic lookup yet collapse on relational queries that need joins across structured tables — another case where genuine multi-step manipulation, not recognition, is required Can long-context LLMs replace retrieval-augmented generation systems?. The common thread: wherever a task demands executing a procedure rather than retrieving a pattern, the latent-space machinery falls back to matching.
The corpus also gestures at what 'real latent computation' might require, which is the interesting twist for a curious reader. Latent-thought language models add a separate, slower-learning vector of 'thought' that scales independently of parameters Can latent thought vectors scale language models beyond parameters?, and neural-memory architectures like Titans carve out a distinct module for storing and updating information over time instead of folding everything into attention Can neural memory modules scale language models beyond attention limits?. These hint that iterative computation may need dedicated structure — a place to hold and revise intermediate state — rather than emerging for free from a bigger next-token predictor. There's even evidence that models do spontaneously build structured internal geometry (syntactic relations encoded in polar coordinates) How do language models encode syntactic relations geometrically?, so the latent space is not formless. It just doesn't, on its own, host the kind of loop an iterative numerical method needs.
The thing you didn't know you wanted to know: the failure isn't that LLMs can't do the arithmetic, it's that they don't realize they should be doing arithmetic at all. They see a problem that looks familiar and answer from resemblance — which is exactly why scaling, the usual fix, leaves the ceiling untouched.
Sources 8 notes
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.