INQUIRING LINE

Why do benchmark scores rise while reasoning quality declines?

This explores why a model can score higher on benchmarks while the actual quality of its reasoning gets worse — and the corpus shows this gap comes from at least three distinct mechanisms: contaminated tests, shortcut-rewarding training, and the fact that standard metrics only grade the final answer.


This explores why a model can score higher on benchmarks while the actual quality of its reasoning gets worse. The corpus traces the gap to three separate failure points, and they compound. The first is that the benchmark itself may be measuring memory, not thought. Qwen2.5-Math-7B can reconstruct over half of MATH-500 from partial prompts yet scores 0% on a clean post-release test — so gains attributed to 'reasoning' are partly the model recalling answers it already saw Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Importantly, this doesn't mean training never works: behavioral activation of genuine reasoning and benchmark improvement are separable phenomena that can occur side by side, which is exactly why a rising score is ambiguous evidence Can genuine reasoning activation coexist with contaminated benchmarks?.

The second mechanism is that the training that lifts scores can actively hollow out reasoning. Supervised fine-tuning raises final-answer accuracy while cutting the information content of the reasoning steps by ~39% — the model arrives at correct answers through post-hoc rationalization and pattern-matching shortcuts rather than genuine inference, and becomes less auditable in the process Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The deeper reason this is invisible is methodological: most benchmarks grade only the final answer. When you score traces instead of solutions, the apparent ceiling drops — one benchmark found a 20% real ceiling that trace-based scoring would inflate by counting stylistic 'reasoning mimicry' as the real thing Should reasoning benchmarks score final answers or reasoning traces?. And mimicry is cheap: chains of thought that are logically invalid perform nearly as well as valid ones, because the model is learning the *form* of reasoning, not the inference itself Does logical validity actually drive chain-of-thought gains?.

The third mechanism is that the knobs we turn to push scores up have non-monotonic effects — more is not better past a point. Increasing thinking tokens from ~1,100 to ~16K dropped accuracy from 87% to 70%, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. Optimal chain-of-thought length follows an inverted-U, and tellingly, more capable models and RL training naturally gravitate toward *shorter* chains — simplicity emerges from good reward signals, so a model padding its reasoning to look thorough is often a worse model, not a better one Why does chain of thought accuracy eventually decline with length?. Errors also snowball step by step regardless of which fancy reasoning framework you use, so the appearance of elaborate deliberation doesn't buy reliability Does the choice of reasoning framework actually matter for test-time performance?.

What ties this together — and is the part you might not expect — is *where* the real reasoning signal actually lives. Only about 20% of tokens are high-entropy 'forking points' where the model makes a genuine decision; training on just those matches full training Do high-entropy tokens drive reasoning model improvements?. Most of the visible reasoning trace is filler around a few load-bearing moments, which is why a longer, more impressive-looking trace can coexist with worse decisions at the points that matter. And the fragility is real: reasoning accuracy falls from 92% to 68% with just 3,000 tokens of irrelevant padding, far below the context limit, even with chain-of-thought prompting reasoning-performance-degrades-with-input-length-even-far-below-context-length-l.

The through-line: a benchmark score is a single number measuring the final answer, while reasoning quality lives in the steps, the decision points, and robustness to distraction. Optimize hard for the number and you can get all three forms of decay at once — memorized test items, shortcut-trained answers, and bloated traces — each of which lifts the score for reasons that have nothing to do with thinking better. If you want one provocative thread to pull, it's that the field's own benchmarks disagree on whether 'content-independent' reasoning is even the right target, since humans and LLMs fail along the same content-sensitivity axis Do language models fail reasoning tests that humans pass?.


Sources 12 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Next inquiring lines