Does deep-thinking ratio measure computational effort better than chain-of-thought length?
This explores whether the 'deep-thinking ratio' (the share of a model's compute spent in genuine adaptive reasoning) is a more honest measure of computational effort than simply counting chain-of-thought tokens — and the corpus suggests CoT length is a famously leaky proxy for effort.
This explores whether deep-thinking ratio beats raw chain-of-thought length as a measure of how much real computational work a model is doing. The corpus doesn't use the exact phrase, but it lands hard on the underlying premise: CoT length is a surprisingly bad proxy for effort. The cleanest evidence comes from controlled maze experiments showing that trace length tracks how close a problem is to the training distribution — not how hard the problem actually is Does longer reasoning actually mean harder problems?. In-distribution, longer traces look like harder problems; out-of-distribution the correlation collapses entirely. So a long chain often signals 'I'm recalling a familiar schema,' not 'I'm computing harder.'
If length measured effort, more length would mean more accuracy — but it doesn't. Accuracy follows an inverted-U: it peaks at an intermediate CoT length and declines past it, with the optimum shifting shorter as models get more capable Why does chain of thought accuracy eventually decline with length?. Push thinking tokens from ~1,100 to ~16K and benchmark accuracy can fall from 87% to 70%, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. Token count and useful computation are clearly not the same quantity.
The deeper point is that *what's happening inside* the tokens matters more than how many there are. The same 'thinking mode' machinery can be counterproductive self-doubt in a vanilla model and productive gap-analysis after RL training — training changes the quality of reasoning, not just its quantity Does extended thinking help or hurt model reasoning?. A shift-cipher decomposition splits CoT performance into three independent factors — output probability, memorization, and genuine (error-accumulating) reasoning — meaning two equally long chains can be doing wildly different amounts of real work What three separate factors drive chain-of-thought performance?. And outside the training distribution, chains stay fluent while the logic underneath quietly breaks Does chain-of-thought reasoning actually generalize beyond training data?. Length keeps flowing even as effort stops paying off.
This is exactly why a *ratio* — effort spent on real reasoning versus total output — is conceptually sharper than a raw count. Some of the strongest results decouple the two completely: a 27M-parameter latent-recurrent model solved extreme Sudoku and 30×30 mazes with *zero* visible CoT tokens, computing in hidden space, while token-emitting CoT scored zero Can models reason without generating visible thinking steps?. There, visible chain length measures nothing about the computation. Conversely, non-reasoning models can't close the gap just by being handed more inference budget — the training regime is what makes additional tokens productive in the first place Can non-reasoning models catch up with more compute?. And when you do want to spend more compute well, allocating it across diverse abstractions (structured breadth) beats simply extending depth-only chains Can abstractions guide exploration better than depth alone?.
The takeaway you didn't know you wanted: 'thinking longer' and 'thinking harder' are different axes, and the field is steadily moving from counting tokens toward measuring what fraction of them carry real computation — because length can be inflated by familiarity, padded by self-doubt, or bypassed entirely by hidden reasoning.
Sources 9 notes
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.