What limits external scaling when a model lacks reasoning foundation?

This explores why throwing more external resources at a model — more inference compute, more parameters, longer chains of thought — hits a wall when the model wasn't trained with a genuine reasoning protocol underneath. The corpus is unusually pointed on this: the ceiling isn't about how much compute you add, it's about whether the model has the internal structure to use it.

The sharpest claim is that extra inference budget only pays off if training already installed a way to spend it. Non-reasoning models don't catch up to reasoning models no matter how many tokens you let them generate, because training is what makes additional tokens *productive* rather than just longer Can non-reasoning models catch up with more compute?. This is the flip side of the well-known result that inference compute can substitute for parameter scaling on hard prompts Can inference compute replace scaling up model size? — the substitution works *because* the model knows what to do with the compute. Strip out that foundation and the trade-off collapses.

What fills the gap when the foundation is missing turns out to be more text, not more thinking. On genuine constrained-optimization and numerical tasks, models plateau around 55–60% regardless of architecture, parameter count, or training regime Do larger language models solve constrained optimization better?, and reasoning variants with extended chain-of-thought show no consistent edge — they produce more words, not more iterative computation Do reasoning models actually beat standard models on optimization?. The bottleneck is a missing numeric procedure, and you can't scale your way across it.

The corpus also reframes what "lacking a foundation" even means. Several notes argue the failures we blame on reasoning are really failures of *execution* — text-only models can't carry out long multi-step procedures even when they know the algorithm, and giving them tools lets them sail past the supposed reasoning cliff Are reasoning model collapses really failures of reasoning?. Others locate the limit in *organization*: models wander and abandon promising paths prematurely Why do reasoning models abandon promising solution paths?, or carry all the linearly-decodable features for a task while their internal representations are fractured and brittle to any shift Can models be smart without organized internal structure?. And a deeper note suggests models succeed by fitting per-instance patterns rather than learning general algorithms — so a reasoning chain only works if it resembles something seen in training, which is exactly why external scaling on novel instances stalls Do language models fail at reasoning due to complexity or novelty?.

The quietly useful takeaway: scaling that *does* work tends to smuggle in a foundation rather than replace one. Self-improvement loops stall on their own and only progress by importing external anchors — judges, tool feedback, prior model versions Can models reliably improve themselves without external feedback?. And the productive direction for scaling reasoning isn't just longer serial chains but sampling parallel latent trajectories or iterating in continuous hidden space Can reasoning systems scale wider instead of only deeper? Can models reason without generating visible thinking tokens? — both of which presuppose a model that can actually *use* the extra width or depth. External scaling, in short, is a multiplier on whatever reasoning structure is already there; multiply by zero and you get more tokens, not more thought.

Sources 11 notes

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

What limits external scaling when a model lacks reasoning foundation?

Sources 11 notes

Next inquiring lines