Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
Token count is an unreliable proxy for reasoning quality. Longer reasoning does not consistently correlate with accuracy and may signal overthinking that degrades performance. Confidence-based metrics fare no better. The question is: how do you measure whether a model is actually thinking rather than merely generating?
Deep-thinking ratio (DTR) operationalizes this by looking inside the model. At each token position, intermediate-layer hidden states are projected into the vocabulary space and compared to the final-layer prediction distribution. Tokens whose predictions stabilize early — where shallow layers already predict the same thing as deep layers — reflect low computational effort. Tokens whose predictions undergo sustained revision through deeper layers before converging are "deep-thinking tokens" — the model is genuinely computing something at that position.
DTR is the proportion of deep-thinking tokens in a generated sequence. Across AIME 24/25, HMMT 25, and GPQA-diamond with GPT-OSS, DeepSeek-R1, and Qwen3, DTR exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines.
The practical application is Think@n: a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. Rather than standard self-consistency (generate n samples, majority vote), Think@n selects samples where the model was genuinely reasoning rather than pattern-matching. Think@n matches or exceeds self-consistency performance while significantly reducing inference costs by enabling early rejection of unpromising generations based on short prefixes.
DTR complements several existing measurement approaches. Because Do reflection tokens carry more information about correct answers?, MI peaks identify which tokens matter while DTR identifies how deeply the model computes at each token — orthogonal measurements of the same underlying phenomenon. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, DTR provides a token-level mechanistic explanation for the sequence-level observation: performative reasoning should show low DTR (early layer stabilization), while genuine reasoning should show high DTR (deep revision).
The deeper implication aligns with Where does LLM reasoning actually happen during generation?: DTR measures what's happening in latent-state dynamics (H1), not in the surface trace (H2). Two sequences with identical token counts and identical surface text could have radically different DTR — one genuinely reasoning, the other pattern-matching. This is exactly the kind of metric the H1 framework calls for.
The shift from "how long they think" to "how deeply they think" reframes efficiency: the goal is not shorter chains but denser computation per token.
Source: Cognitive Models Latent Paper: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Related concepts in this collection
-
Do reflection tokens carry more information about correct answers?
Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
MI peaks identify which tokens matter; DTR identifies how deeply the model computes; orthogonal
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
DTR provides token-level mechanism for sequence-level Reasoning Theater finding
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
DTR explains why: tokens past threshold have low DTR (filler, not thinking)
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
DTR provides a measurement tool to test the myth directly
-
Where does LLM reasoning actually happen during generation?
Does multi-step reasoning emerge from visible chain-of-thought text, hidden layer dynamics, or simply more computation? Three competing hypotheses make different predictions and can be empirically tested.
DTR is an H1-native metric: measures latent dynamics, not surface form
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
DTR explains the mechanism: correct traces are shorter because they contain higher-DTR tokens (genuine computation) with less low-DTR filler
-
Which tokens in reasoning chains actually matter most?
Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.
complementary token-level measurement: greedy pruning identifies causal importance, DTR identifies computational depth; both reveal that tokens are not created equal
-
Can reasoning steps be dynamically pruned without losing accuracy?
This explores whether chain-of-thought reasoning contains redundant steps that can be identified and removed during inference. Understanding which steps matter could improve efficiency while maintaining correctness.
DTR could operationalize the redundancy measurement: redundant steps should show low DTR (early layer stabilization)
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
layer separation provides architectural grounding: deep-thinking tokens are those where higher reasoning layers actively revise lower-layer predictions
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
deep-thinking ratio measures genuine reasoning effort by tracking layer-wise prediction stabilization — outperforming length and confidence as accuracy predictors