LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can we measure how deeply a model actually reasons?

What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?

Note · 2026-04-20 · sourced from Cognitive Models Latent

Token count is an unreliable proxy for reasoning quality. Longer reasoning does not consistently correlate with accuracy and may signal overthinking that degrades performance. Confidence-based metrics fare no better. The question is: how do you measure whether a model is actually thinking rather than merely generating?

Deep-thinking ratio (DTR) operationalizes this by looking inside the model. At each token position, intermediate-layer hidden states are projected into the vocabulary space and compared to the final-layer prediction distribution. Tokens whose predictions stabilize early — where shallow layers already predict the same thing as deep layers — reflect low computational effort. Tokens whose predictions undergo sustained revision through deeper layers before converging are "deep-thinking tokens" — the model is genuinely computing something at that position.

DTR is the proportion of deep-thinking tokens in a generated sequence. Across AIME 24/25, HMMT 25, and GPQA-diamond with GPT-OSS, DeepSeek-R1, and Qwen3, DTR exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines.

The practical application is Think@n: a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. Rather than standard self-consistency (generate n samples, majority vote), Think@n selects samples where the model was genuinely reasoning rather than pattern-matching. Think@n matches or exceeds self-consistency performance while significantly reducing inference costs by enabling early rejection of unpromising generations based on short prefixes.

DTR complements several existing measurement approaches. Because Do reflection tokens carry more information about correct answers?, MI peaks identify which tokens matter while DTR identifies how deeply the model computes at each token — orthogonal measurements of the same underlying phenomenon. Because Does chain-of-thought reasoning reflect genuine thinking or performance?, DTR provides a token-level mechanistic explanation for the sequence-level observation: performative reasoning should show low DTR (early layer stabilization), while genuine reasoning should show high DTR (deep revision).

The deeper implication aligns with Where does LLM reasoning actually happen during generation?: DTR measures what's happening in latent-state dynamics (H1), not in the surface trace (H2). Two sequences with identical token counts and identical surface text could have radically different DTR — one genuinely reasoning, the other pattern-matching. This is exactly the kind of metric the H1 framework calls for.

The shift from "how long they think" to "how deeply they think" reframes efficiency: the goal is not shorter chains but denser computation per token.


Source: Cognitive Models Latent Paper: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Related concepts in this collection

Concept map
17 direct connections · 135 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

deep-thinking ratio measures genuine reasoning effort by tracking layer-wise prediction stabilization — outperforming length and confidence as accuracy predictors