How does question difficulty and breadth affect what models learn to reason?

This explores how the difficulty of training problems — and the variety of sources models learn from — shapes what kind of reasoning they actually acquire, not just whether their answers get more accurate.

This explores how the difficulty of training problems — and the breadth of what models train on — shapes what kind of reasoning they actually pick up, rather than just whether their final answers improve. The corpus's sharpest finding is that difficulty isn't a single dial that turns reasoning up or down: different difficulty levels reinforce genuinely different internal behaviors. Easy problems reward shortcuts and actively suppress deliberation; hard problems only strengthen real reasoning on the rare occasions the model succeeds; and medium-difficulty problems are the sweet spot where both deliberation and answer-finding get reinforced at once What reasoning features does each difficulty level reinforce?. The unsettling corollary: two training runs can show identical accuracy gains while having pushed the model's internals in opposite directions.

What actually makes a problem 'hard' for a model turns out not to be complexity in the human sense. Reasoning breaks down at the boundary of *unfamiliarity*, not at some complexity threshold — models fit patterns tied to specific instances rather than learning general algorithms, so a long reasoning chain succeeds fine if the model trained on similar instances, and a short one fails if the instance is novel Do language models fail at reasoning due to complexity or novelty?. This reframes 'breadth' as the real lever: the more varied the instances a model has seen, the wider the territory it can reason across. That connects directly to where reasoning ability comes from in the first place — broad, transferable *procedural* knowledge scattered across diverse pretraining documents drives reasoning generalization, whereas narrow factual recall just memorizes specific documents Does procedural knowledge drive reasoning more than factual retrieval?. Breadth of process beats depth of any single fact.

Difficulty also governs how *much* reasoning is worth doing. Optimal chain-of-thought length follows an inverted U — accuracy peaks at an intermediate length that grows with task difficulty but shrinks as the model itself gets more capable Why does chain of thought accuracy eventually decline with length?. Harder questions genuinely warrant more thinking; but models are bad at calibrating this on their own. They can actually *detect* a question's difficulty — it's linearly decodable from their hidden states before they even start reasoning — yet they override that signal and overthink easy problems anyway Can models recognize question difficulty before they reason?. The failure is one of action, not perception. The same blind spot shows up at the easy extreme: faced with ill-posed or premise-missing questions, reasoning models churn out elaborate chains instead of recognizing there's nothing to solve, because training rewarded *producing* reasoning steps and never taught them when to disengage Why do reasoning models overthink ill-posed questions?.

The deepest worry is that difficulty and breadth can quietly change what 'learning to reason' even means. Supervised fine-tuning can raise benchmark scores while *cutting* the actual inferential quality of intermediate steps by nearly 40% — models learn to reach correct answers through post-hoc rationalization rather than genuine inference, and standard accuracy metrics never catch it Does supervised fine-tuning improve reasoning or just answers?. This is why some researchers argue the texture of the training signal matters as much as the difficulty: training on messy exploration — failed attempts, backtracking, self-correction — teaches more robust reasoning than feeding models clean shortcut solutions Can models learn better by training on messy exploration paths?. And it matters because depth punishes brittle reasoning brutally: models that 'wander' rather than search systematically see their success probability fall exponentially as problems get deeper, so medium problems stay tractable while deep ones become catastrophically hard Why do reasoning LLMs fail at deeper problem solving?.

Put together, the corpus suggests a counterintuitive picture: much of what models 'learn to reason' is already latent in the base model, and training mostly *selects* which reasoning gets surfaced Do base models already contain hidden reasoning ability?. Difficulty determines which latent behaviors get reinforced versus suppressed; breadth determines how far that reasoning transfers. The takeaway you might not have expected to want: if you want models that reason rather than rationalize, the difficulty mix and the diversity of process you train on matter more than how many more problems they get right.

Sources 10 notes

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can models recognize question difficulty before they reason?

Linear probes successfully decode difficulty from LRM representations before reasoning begins, yet models still overthink simple questions. This reveals an action-commitment failure rather than a perception failure.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can models learn better by training on messy exploration paths?

Research shows that training on messy trajectories—failed attempts, self-correction, and backtracking—teaches more robust reasoning than training only on shortcut solutions. This approach models o1-style deep reasoning as search internalization rather than solution memorization.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

How does question difficulty and breadth affect what models learn to reason?

Sources 10 notes

Next inquiring lines