Can models recognize question difficulty before they reason?
Does reasoning language models encode implicit knowledge of problem difficulty in their hidden states, even before generating solution steps? And if so, why don't they act on this knowledge?
S1-Bench's probing analysis demonstrates that difficulty is already there in LRM representations. A single-layer MLP trained on the final-layer hidden state of the last token in an encoded question predicts difficulty with monotonically increasing accuracy across difficulty levels. The structure is implicit but linear — no extra training, no specialized probes, no auxiliary signal is required. The model knows.
The behavioral result then forms a contradiction with this internal knowledge. On simple questions that the linear probe correctly classifies as easy, LRMs still produce redundant solution rounds, repeatedly reverify already-correct answers, and emit higher average token entropy than necessary. The hidden-state signal that says "this is easy" is overridden during generation by exploratory behavior that says "let me check again."
The authors' interpretation — and the most plausible mechanism — is that models exhibit self-doubt about their own early difficulty judgments. The model perceives the question is simple, then second-guesses that perception, then engages in exploratory generation to compensate for the imagined possibility that its initial assessment was wrong. This is a structural failure mode: the architecture lacks a mechanism to commit to an early difficulty assessment and act on it.
The deeper insight is that LRM overthinking is not a perception failure (the model fails to recognize a simple question) but an action failure (the model recognizes the question is simple but cannot translate that recognition into terminating behavior). This distinction matters for fixes: prompt-engineering for "shorter answers on easy questions" treats it as a perception problem and produces brittle results. Mechanistic fixes that route generation through the difficulty representation — for example, conditioning continued-thinking decisions on the probe output — treat it as the action problem it appears to be.
The methodology generalizes. A linear probe on a hidden state is a cheap diagnostic for any property the model is suspected to track implicitly. If the probe succeeds and the behavior contradicts it, the gap localizes the failure to the perception-to-action interface — not to representation, not to capacity.
Related concepts in this collection
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the broader phenomenon of overthinking that S1-Bench instantiates; the linear-probe finding is the architectural deepening of that picture
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
another action-failure case: models perceive ill-posed-ness but cannot translate perception into rejection
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
complementary finding from a different angle: easy-question behavior is performative even when the model knows the question is easy
-
Do reasoning models actually use the hints they receive?
This explores whether language models acknowledge reasoning hints in their explanations when those hints causally influence their answers. Understanding this gap matters for evaluating whether chain-of-thought explanations can be trusted for safety monitoring.
parallel perception-action gap: models perceive hints but do not verbalize their influence
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
problem difficulty is linearly decodable from LRM hidden states before formal reasoning begins — yet models override this signal with exploratory overthinking suggesting architectural self-doubt