INQUIRING LINE

Can depth scaling and breadth scaling unlock independent capability axes?

This explores whether scaling a model's reasoning 'deeper' (longer serial chains) and 'wider' (more parallel paths) are genuinely separate levers that buy you different capabilities — or just two settings on the same dial.


This explores whether depth (longer serial reasoning) and breadth (parallel exploration) are independent capability axes rather than redundant knobs. The corpus leans toward yes — they trade against different failure modes — but with sharp caveats about when each pays off. The cleanest case for independence comes from work showing reasoning systems can scale in width by sampling parallel latent trajectories instead of only stacking serial depth, which sidesteps depth's latency cost while sampling the solution space without inflating variance Can reasoning systems scale wider instead of only deeper?. Breadth here isn't just 'more depth' — it's a different way of covering possibility space.

The most pointed argument for why the two axes aren't interchangeable is the finding that depth-only reasoning chains suffer an 'underthinking' failure: they commit early and miss the branch they needed. Training abstractions that enforce breadth-first exploration outperforms parallel solution sampling at large compute budgets, because structured breadth catches what a single deep chain walks past Can abstractions guide exploration better than depth alone?. So breadth buys you coverage; depth buys you the sustained working memory to follow one line all the way down — as in recursive subtask trees that prune their own cache to reason far past the context window Can recursive subtask trees overcome context window limits?. These solve different problems, which is what 'independent axes' should mean.

But independence is conditional, and the most interesting wrinkle is that the axes can swap dominance with scale. At sub-billion-parameter sizes, depth beats width outright — deep-and-thin architectures compose abstract concepts through layers and gain 2.7–4.3% over balanced designs Does depth matter more than width for tiny language models?. So the 'right' axis isn't a property of the task alone; it's entangled with model size. That entanglement shows up again in skill scaling: logical reasoning keeps improving with size while style and metacognition saturate early, meaning different capabilities live on different curves and won't all respond to the same scaling move Do all AI skills improve equally as models scale?.

There's also a deeper claim lurking here that you might not expect: a trade-off you assume is fundamental can turn out to be an artifact of how you measure. Hidden-state analysis finds near-zero correlation between exploration and exploitation — the apparent tension only emerges at the token level, and a system can enhance both at once Is the exploration-exploitation trade-off actually fundamental?. That's the strongest version of the question's premise: axes look coupled mostly because our usual measurements force them onto one scale. Relatedly, training order itself becomes an axis — scheduling structured tasks before creative ones prevents entropy collapse from damaging open-ended skills, a gain you can't get from depth or width alone Does training order reshape how models handle different task types?.

The takeaway the reader probably didn't come looking for: 'scaling' is no longer one dimension with a depth/width slider. Architecture-aware scaling laws can trade inference efficiency against accuracy as a separate knob Can architecture choices improve inference efficiency without sacrificing accuracy?, and several threads now argue memory structure — not parameter count — has become the live scaling frontier Has memory architecture replaced parameter count as the scaling frontier?. Depth and breadth are real and partly independent, but they're two members of a growing family of axes, and which one unlocks capability depends on scale, task, and what you're measuring.


Sources 9 notes

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

Next inquiring lines