Why does target probability matter more than task logical complexity?
This explores why a model's likelihood of producing a given answer (output/target probability) predicts performance better than how logically hard the task itself is — and what that reveals about what LLM 'reasoning' actually rests on.
This explores why target probability — how likely a model is to generate a particular output given its training — turns out to be a stronger lever on accuracy than the logical complexity of the task. The sharpest single piece of evidence comes from a shift-cipher study that pulled chain-of-thought performance apart into three independent factors: output probability, memorization, and genuine (noisy) reasoning. Holding the task fixed, varying only how probable the target answer was swung accuracy from 26% to 70% What three separate factors drive chain-of-thought performance?. The logic of the cipher never changed; what moved the needle was whether the answer sat in a high-probability region of the model's output space. That alone reframes the question: the model isn't solving harder or easier logic, it's reaching for more or less likely strings.
A cluster of corpus findings converges on the same point from different angles. Reasoning failures don't cluster at complexity thresholds — they cluster at instance-level unfamiliarity. Models fit instance-based patterns rather than general algorithms, so a long, 'hard' chain succeeds if similar instances were seen in training, and a short, 'easy' one fails if novel Do language models fail at reasoning due to complexity or novelty?. Controlled maze experiments make the mechanism visible: trace length tracks difficulty only in-distribution and decouples entirely out-of-distribution, because length reflects recall of training schemas, not adaptive computation Does longer reasoning actually mean harder problems?. In both cases the operative variable is proximity to training distribution — a probability story — not intrinsic task hardness.
The most striking corner is that the logical structure of reasoning can be wrong and it barely matters. Invalid chain-of-thought exemplars perform nearly as well as valid ones on BIG-Bench Hard, meaning the model is learning the *form* of reasoning, not genuine inference Does logical validity actually drive chain-of-thought gains?. The same lesson shows up one level out: instruction tuning on semantically empty or deliberately incorrect instructions matches full correct instructions (43% vs 42.6%), because what transfers is knowledge of the output space, not task understanding Does instruction tuning teach task understanding or output format?. If logical validity and instruction content can be corrupted without hurting performance, then performance was never resting on logic — it was resting on hitting the right output distribution.
There's a revealing flip side: when logical complexity *does* bite, the model copes by leaning even harder on probability. As tasks get harder — NLI to syllogisms to Wason selection — content effects intensify, and both humans and models fall back on semantic priors instead of logical form once working capacity is exceeded Do harder reasoning tasks trigger more semantic bias?. So complexity doesn't engage some separate logic engine; it pushes the system further toward its prior, i.e. toward whatever is probable. That's why probability keeps winning the comparison — it's the thing the model actually does, and difficulty only deepens the dependence.
For a curious reader, the unexpected takeaway is practical: if probability dominates logic, you should be able to improve 'reasoning' without making the model smarter — by reshaping what it's likely to output. The corpus bears this out. Optimal CoT length emerges from reward signals nudging the model toward shorter, higher-probability chains rather than from explicit training on difficulty Why does chain of thought accuracy eventually decline with length?, and the durable gap between reasoning and non-reasoning models comes from a training protocol that makes extra tokens productive — not from raw capability unlocked at inference time Can non-reasoning models catch up with more compute?. The lever is the output distribution the model was shaped to prefer; task complexity is mostly a passenger.
Sources 8 notes
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Content effects intensify as task difficulty increases—from NLI to syllogisms to Wason selection—in both humans and language models. As working capacity is exceeded, both systems fall back on semantic priors instead of logical form.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.