Does task difficulty alone determine how many thinking tokens a model should use?
This explores whether the right amount of reasoning a model should spend is set by how hard the task is, or whether other factors — the model's own skill, how familiar the problem looks, how it was trained — matter just as much.
This explores whether task difficulty alone dictates how many thinking tokens a model should spend. The corpus says no — and fairly emphatically. Difficulty is one input, but it shares the steering wheel with at least three other forces, and one paper argues it isn't even the right variable to be measuring. The cleanest statement of the difficulty effect comes from work showing optimal chain-of-thought length follows an inverted-U: accuracy peaks at some middle length, and that sweet spot does stretch longer as problems get harder Why does chain of thought accuracy eventually decline with length?. So difficulty matters. But the same finding adds a twist — the optimal length *shrinks* as the model gets more capable. A stronger model wants fewer tokens on the same problem. Difficulty and capability are pulling in opposite directions, so you can't read the right token budget off difficulty alone.
The more unsettling result is that thinking length often doesn't track difficulty at all — it tracks how close the problem sits to what the model was trained on. In controlled maze experiments, trace length correlated with difficulty only for in-distribution problems and decoupled completely once the problems drifted out of distribution; the length was really reflecting recall of familiar training schemas, not adaptive effort Does longer reasoning actually mean harder problems?. A companion finding reframes "difficulty" itself: models don't break down at some complexity threshold, they break down at *unfamiliarity* — instance-level novelty, not task-level complexity, is what predicts failure Do language models fail at reasoning due to complexity or novelty?. Two problems of identical difficulty can need wildly different handling if one looks like the training data and the other doesn't.
Then there's the simple fact that more thinking can actively hurt. Pushing thinking tokens from ~1,100 up to ~16K dropped accuracy from 87% to 70% — models overthink easy problems and underthink hard ones, so the relationship between budget and accuracy is non-monotonic in both directions Does more thinking time always improve reasoning accuracy?. Quantity is the wrong knob when quality of thinking isn't fixed. One study makes this vivid: untrained models use their thinking budget to spiral into self-doubt, while RL training redirects the *same* mechanism into productive gap analysis. The token count didn't change — what the tokens were doing did Does extended thinking help or hurt model reasoning?.
The direction the field seems to be heading is to stop legislating a budget from difficulty and instead let the model decide per-instance. Thinkless trains a single model to route between extended reasoning and a direct answer, learning when each is warranted — and notably it does this *without* explicit difficulty labels, calibrating itself from outcomes instead Can models learn when to think versus respond quickly?. That's the tell: if difficulty alone determined the budget, you could label problems by difficulty and set the dial. The fact that self-calibrated routing works better suggests the real signal is something the model senses about a specific instance — familiarity, confidence, whether it's already converging — that a difficulty rating can't capture.
If you want to follow this somewhere unexpected: a separate line of work suggests the thinking tokens may not need to be visible (or even verbalized) at all, with reasoning scaling in continuous latent space instead Can models reason without generating visible thinking tokens? — which would make "how many thinking tokens" the wrong unit of measurement entirely.
Sources 7 notes
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.