Do current math benchmarks measure outcomes or rhetorical plausibility?

This explores a worry beneath math benchmarks: whether a high score reflects a correct answer reached by genuine reasoning, or just well-formed reasoning-shaped text that looks convincing — and what the corpus says about telling those apart.

This explores a worry beneath math benchmarks: when a model scores well, is it measuring a correct outcome, or is it rewarding text that merely *sounds* like reasoning? The corpus suggests the field has caught itself grading the rhetoric more than once — and that the fix is to be ruthless about what counts as a passing signal.

The sharpest evidence is that the *form* of reasoning and the *fact* of reasoning come apart cleanly. Logically invalid chain-of-thought exemplars score nearly as well as valid ones on hard benchmarks — the model picks up the shape of a reasoning trace, not the inference inside it Does logical validity actually drive chain-of-thought gains?. In the same spirit, the length of a reasoning trace turns out to track how close a problem sits to the training distribution rather than how hard it actually is, so a long, elaborate-looking derivation can be recall dressed as deliberation Does longer reasoning actually mean harder problems?. Both findings say the persuasive surface of a solution is a poor proxy for the work underneath.

That's exactly why *how you grade* changes what you measure. One line of work argues benchmarks should score only the final, deterministically-checkable answer, not the steps — because trace-based scoring inflates results by counting stylistic mimicry of reasoning as real capability, in one case turning a true 20% ceiling into something that looks much higher Should reasoning benchmarks score final answers or reasoning traces?. Outcome verification is the antidote to rhetorical plausibility; reward the answer, not the performance of getting there.

But even outcome scores can lie if the outcomes leaked. RLVR's apparent gains on math collapse once you control for contamination: a model can reconstruct half of MATH-500 from partial prompts yet score zero on a clean post-release benchmark, meaning the 'reasoning improvement' was memorization wearing a results-shaped costume Does RLVR success on math benchmarks reflect genuine reasoning improvement?. This isn't unique to math — the identical pattern shows up in theory-of-mind tests, where supervised fine-tuning matches reinforcement learning because templated artifacts let pattern-matching ace the benchmark without any mental-state reasoning Can language models solve ToM benchmarks without real reasoning?. The disease is general; math is just where it's most measurable.

The encouraging counterweight is that when the signal is honest, it's powerful: a single clean training example can lift math accuracy from 36% to 73.6% and keep improving test performance long after training saturates, which only makes sense if the benchmark is reading out a latent capability rather than rewarding surface form Can a single training example unlock mathematical reasoning?. So the answer isn't that benchmarks *can't* measure outcomes — it's that they measure rhetorical plausibility by default and outcomes only under discipline: verify solutions not traces, decontaminate the test set, and stop trusting numerical scores that can't tell you *why* a model failed Can natural language feedback overcome numerical reward plateaus?.

Sources 7 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Do current math benchmarks measure outcomes or rhetorical plausibility?

Sources 7 notes

Next inquiring lines