INQUIRING LINE

Does architectural design matter more than model scale for reasoning tasks?

This explores whether *how* a reasoning system is structured — splitting planning from execution, adding recurrence, externalizing steps — outperforms simply making the model bigger, and the corpus comes down hard on the side of structure.


This explores whether architectural design beats raw model scale for reasoning, and across the collection the answer leans strongly toward structure over size. The most striking single data point: the Hierarchical Reasoning Model solves Sudoku and mazes that defeat chain-of-thought entirely, using just 27M parameters and 1,000 training examples Can recurrent hierarchies achieve reasoning that transformers cannot?. It does this by coupling slow abstract planning with fast detailed computation across two timescales — escaping a depth ceiling that no amount of widening a fixed-depth transformer can fix. Scale doesn't buy you out of an architectural limit.

A recurring theme is *separation of concerns*. Pulling the planner apart from the executor improves both accuracy and generalization — and notably, the decomposition skill transfers across domains while the solving skill does not, suggesting these are genuinely different capabilities that monolithic models tangle together Does separating planning from execution improve reasoning accuracy?. The same instinct shows up in the claim that reasoning systems should decouple *when* to reason from the *capability* to reason: RL post-training mostly teaches timing for mechanisms pretraining already installed How should reasoning systems actually be architected?. And externalizing reasoning into knowledge-graph triples lets GPT-4o *mini* jump 29% on hard agentic tasks Can structuring reasoning as knowledge graphs help smaller models solve complex tasks? — structure substituting for size again.

Here's the part you might not expect: several notes argue the bottleneck often isn't reasoning *capacity* at all, which reframes the whole scale question. Many dramatic 'reasoning collapses' are actually execution failures — models that know the algorithm but can't run it step-by-step in text; give them tools and they sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. Other failures are structural disorganization: models wander down invalid paths or abandon good ones too early, and a simple decoding-level thought-switching penalty fixes it without any fine-tuning Why do reasoning models abandon promising solution paths?. If a tweak to decoding recovers accuracy, the missing ingredient was never parameters.

The lateral tension worth sitting with: scale clearly *does* matter in one specific way. Reasoning models beat non-reasoning ones no matter how much inference compute you throw at the small fry, because training instills a protocol that makes extra tokens productive rather than just longer Can non-reasoning models catch up with more compute?. But that's a *training-structure* advantage, not a parameter-count one — and it has limits, since extended chains show no reliable edge on constraint-bound numerical optimization, where the bottleneck is the numeric procedure itself Do reasoning models actually beat standard models on optimization?. Meanwhile small models can be lifted into the big-model range through better training signal (DPO with explicit negative examples) rather than growth Can small models match large models on function calling?.

If there's one thing to walk away knowing: the corpus suggests reasoning ability is less a smooth function of size than a question of whether the right *structure* is present — separated planning, recurrent depth, externalized state, the right exploration shape Can abstractions guide exploration better than depth alone? — and that many apparent reasoning failures are really unfamiliar *instances* the model never patterned on, not complexity it lacked the scale to handle Do language models fail at reasoning due to complexity or novelty?. Architecture and training regime keep winning the comparisons; scale alone keeps hitting ceilings it can't widen its way past.


Sources 11 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

How should reasoning systems actually be architected?

Research shows RL post-training teaches models *when* to use reasoning mechanisms that pre-training already provides. Decoupled architectures, latent reasoning in continuous space, and interleaved action-grounding all outperform monolithic chain-of-thought approaches.

Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Knowledge Graph of Thoughts (KGoT) achieves 29% improvement on GAIA Level 3 tasks using GPT-4o mini by externalizing reasoning into iteratively constructed KG triples. The approach improves transparency, reduces bias, and enables quality control over reasoning steps.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Next inquiring lines