LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does longer reasoning actually mean harder problems?

Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

A prevailing assumption: longer reasoning traces indicate more thinking effort, therefore more complex problems should produce longer traces. Controlled experiments undercut this completely.

Training transformer models from scratch on derivational traces of the A* search algorithm — where problem complexity is precisely controllable and verifiable — reveals the decoupling:

The interpretation: intermediate token sequence length reflects approximate recall from the training distribution, not problem-adaptive computation. When a problem is close to training examples, the model retrieves a matching schema whose length reflects the training data's length distribution for that problem type. When a problem is far from training, the model has no calibrated schema to retrieve — trace length becomes arbitrary.

This challenges the entire anthropomorphic framing of "thinking time." When DeepSeek-R1 or similar models produce long chains, the conventional interpretation is that the problem is hard and the model is "working through it." The A* evidence suggests the length may primarily indicate how close the problem is to training distribution, not how much genuine computation is occurring.

The practical implication: trace length is not a reliable proxy for problem difficulty. Length-based scaling heuristics (add more tokens for harder problems) may be calibrating to the wrong signal. Does more thinking time always improve reasoning accuracy? supports this: more tokens do not reliably help after a certain point.

This also deepens Does chain-of-thought reasoning reveal genuine inference or pattern matching?: if trace length reflects training distribution proximity, then even the amount of imitation is calibrated to training similarity, not actual inferential needs.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
12 direct connections · 119 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cot trace length reflects training distribution proximity, not problem difficulty