Can benchmark improvements hide degradation of deliberative reasoning?

This explores whether a model's score going up on a benchmark can mask a real loss in its underlying step-by-step reasoning — the corpus suggests benchmark gains and reasoning quality are measured at different levels and can move in opposite directions.

This explores whether a model's score going up on a benchmark can mask a real loss in its underlying step-by-step reasoning. The corpus says yes — and the cleanest case is that benchmark improvement and genuine reasoning are *separable phenomena*. One study shows RLVR can activate authentic reasoning patterns while the benchmark number climbs for an entirely different reason: memorization of contaminated test data Can genuine reasoning activation coexist with contaminated benchmarks?. The score and the skill live at different measurement levels, so a rising score is not proof the reasoning got better — it might not even be the same thing being measured.

The deeper worry is that fluent-looking reasoning can be hollow. Chain-of-thought traces degrade predictably once you step outside the training distribution, producing text that *imitates the form* of reasoning while the underlying logic is invalid Does chain-of-thought reasoning actually generalize beyond training data?. A benchmark that samples in-distribution problems will reward this fluent imitation and never reveal the rot underneath. Worse, some apparent reasoning isn't reasoning at all but procedural execution — models that 'collapse' on hard problems often know the algorithm and simply can't carry it out at scale, a bandwidth limit that masquerades as a reasoning cliff Are reasoning model collapses really failures of reasoning?.

There's also a counterintuitive trap: more deliberation can make things worse even as you'd expect it to help. Accuracy peaks and then *declines* as thinking tokens grow — one model fell from 87% to 70% just by thinking longer, overthinking easy problems and underthinking hard ones Does more thinking time always improve reasoning accuracy?. Optimal chain length follows an inverted-U, and more capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. So a system that looks like it's reasoning harder may be reasoning worse — and models wander down promising paths only to abandon them prematurely, a structural disorganization that compute alone doesn't fix Why do reasoning models abandon promising solution paths?.

The sharpest blind spot is that some degradations are *uncorrelated with the metrics we usually trust.* Reasoning accuracy drops from 92% to 68% with just 3,000 tokens of irrelevant padding — far below the context limit, task-agnostic, and uncorrelated with language-modeling performance reasoning-performance-degrades-with-input-length-even-far-below-context-length. A model can ace a benchmark of short, clean problems and quietly fall apart on the longer, messier inputs of real use, and no standard score would warn you.

What ties this together: the same mechanism (extended thinking) can be either helpful or harmful depending on training, not on the headline number — RL training flips thinking mode from self-doubt into productive analysis without changing how much the model thinks Does extended thinking help or hurt model reasoning?. If you want to detect hidden degradation rather than be fooled by it, the corpus points to diagnostics that read the *process*, not the score: confidence variance can distinguish overthinking from underthinking in real time Can confidence patterns reveal overthinking versus underthinking?. The takeaway you didn't know you wanted: a benchmark measures whether the answer is right, but deliberative reasoning is a property of *how* the answer was reached — and those two can drift apart silently.

Sources 9 notes

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Can benchmark improvements hide degradation of deliberative reasoning?

Sources 9 notes

Next inquiring lines