SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation

Does LLM math reasoning truly generalize or just pattern match?

This explores whether high scores on math benchmarks reflect genuine reasoning ability or merely template familiarity. The question matters because it determines how much we should trust LLMs on novel numerical problems.

Synthesis note · 2026-06-03 · sourced from Reasoning Critiques

GSM8K's near-saturated scores suggest LLM math reasoning has genuinely advanced — GSM-Symbolic tests whether that is real by regenerating the same questions from symbolic templates. The findings deflate the headline. Models show notable variance across different instantiations of the same question (single-point accuracy is unreliable), performance declines when only the numerical values change (proper-name changes hurt less), and degrades as question complexity rises. Most damning, GSM-NoOp — adding a clause that is related but irrelevant to the answer — causes large drops, exposing that models cannot reliably discern relevant from irrelevant information. The conclusion: reasoning here is probabilistic pattern-matching, not formal reasoning.

The keeper is the diagnostic method (controlled symbolic perturbation) and the verdict: benchmark gains can reflect template familiarity rather than reasoning, and the fragility is structural, not a tuning gap.

This is a landmark anchor for the vault's reasoning-fragility cluster. It converges with Do language models fail at reasoning due to complexity or novelty? and Do large language models reason symbolically or semantically?, and the number-sensitivity echoes Do large language models actually perform iterative optimization?.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM math reasoning is fragile pattern-matching — accuracy drops when only numbers change and irrelevant clauses derail it