Why do LLMs struggle more when only numerical values change?

This explores why swapping out just the numbers in a problem — leaving the wording and structure intact — can tank an LLM's accuracy, and what that reveals about whether models are reasoning or pattern-matching.

This explores why changing only the numerical values in a problem — same story, same structure, different numbers — hurts LLMs more than you'd expect if they were actually doing the math. The corpus has a sharp answer: the model was never doing the math. It was recognizing the problem as similar to ones it had seen and emitting a plausible-looking answer shaped by that template. When the numbers change, the surface pattern still matches, so the model stays confident — but the arithmetic underneath was never genuinely computed, so the answer drifts. The cleanest evidence is GSM-Symbolic, which found that LLMs show high variance when you reformulate questions and decline sharply when only the numbers move, exactly the signature of probabilistic pattern-matching rather than symbolic reasoning Does LLM math reasoning truly generalize or just pattern match?.

Why can't the model just run the numbers? Because there's growing evidence it has no internal procedure to run. One study shows LLMs cannot actually execute iterative numerical methods in latent space — they recognize an optimization problem as template-similar to memorized cases and emit values that look right but aren't, a failure that doesn't go away with scale Do large language models actually perform iterative optimization?. So the number isn't a variable being plugged into a computation; it's part of the surface pattern being matched. Change it and you've changed the pattern without giving the model any compensating mechanism.

This fits a broader shape that recurs across the collection: LLMs learn surface heuristics that work until structure stresses them. Grammatical competence degrades predictably as sentences get more deeply nested, suggesting the model learned shortcuts rather than real grammar rules Does LLM grammatical performance decline with structural complexity?. And the gap shows up vividly in 'Potemkin understanding,' where a model can correctly explain a concept and then fail to apply it — the explaining pathway and the doing pathway are functionally disconnected Can LLMs understand concepts they cannot apply?. A model that can describe how to solve a class of problem but can't reliably execute it on fresh numbers is the same disconnect wearing a math hat. Researchers catalog these as distinct, repeatable epistemic failure modes, not random wrongness How do LLMs fail to know what they seem to understand?.

The most useful turn in the corpus is what to do about it. One line of work argues the productive architecture is to stop asking LLMs to be calculators at all: let them do what they're genuinely good at — translating messy natural language into formal structure — and hand the numeric iteration to a deterministic solver Should LLMs handle abstraction only in optimization?. The same paper notes LLMs plateau at constraint satisfaction regardless of scale, which is the same wall the number-swapping experiments hit from a different angle Should LLMs handle abstraction only in optimization?. The thing you didn't know you wanted to know: the fix for fragile arithmetic probably isn't a bigger model, it's a division of labor where the model never touches the arithmetic.

Sources 6 notes

Does LLM math reasoning truly generalize or just pattern match?

GSM-Symbolic found that LLMs show high variance across question reformulations, decline sharply when numbers change, and fail when irrelevant but related clauses are inserted. These failures indicate probabilistic pattern-matching rather than true symbolic reasoning.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Should LLMs handle abstraction only in optimization?

LLMs plateau at constraint satisfaction regardless of scale, but excel at natural-language-to-formal-structure translation. The productive architecture restricts LLMs to reading input and emitting solver code, leaving numeric iteration to deterministic solvers.

Why do LLMs struggle more when only numerical values change?

Sources 6 notes

Next inquiring lines