LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

In verifier-free inference-time compute experiments (Think Deep, Think Fast), non-reasoning models fall substantially behind reasoning models even when given an extremely high inference budget. The gap doesn't close with more compute — it just stays there.

This sets a hard limit on Can inference compute replace scaling up model size?. The substitution works within a training regime, but not across training regimes. A standard instruction-tuned model with more inference compute cannot replicate what a model trained specifically for extended reasoning can do, even given equivalent token budgets.

Why? Reasoning models have internalized the reasoning process through training — they know how to use additional tokens productively. Non-reasoning models don't have this structure, so additional tokens degrade into noise or verbosity rather than improved reasoning. The training regime instills the reasoning protocol that makes inference compute usable.

Qualification from targeted activation (Base Models paper): The gap is substantially closeable through targeted steering of base model activations without weight updates. A hybrid model using base model weights + thinking model deployment decisions recovers 91% of the performance gap while steering only 12% of tokens. This doesn't invalidate the finding — non-reasoning models without steering still fall behind — but it significantly changes what "non-reasoning model" means in practice. If capability already exists latently and steering can surface it, the gap is about deployment mechanisms, not raw capability. See Does RL teach reasoning or just when to use it?.

The imitation learning ceiling (Tutorial on LLM Reasoning): SFT/imitation learning creates an intelligence upper bound: the model is bounded by the quality of demonstrations it learns from, unable to surpass the skill level present in training data. RL + world models is the path beyond this ceiling, because RL allows discovery of strategies that exceed any individual demonstration. This provides the mechanism for why reasoning-specific training matters: it is not merely "more training" but training that enables exceeding the imitation ceiling.

This is a strong argument for the necessity of reasoning-specific post-training, not just inference-time tricks. Compute can amplify capability but cannot manufacture it. The dependency on training regime appears to be capability-specific: Can language models learn grammar from child-scale data? — syntactic competence scales down readily, achievable with human-scale data and the right composition. Reasoning capability requires the opposite: specialized training that instills the reasoning protocol itself. The lesson is not "you need a bigger model" but "you need the right training for the capability you want."


Source: Test Time Compute

Related concepts in this collection

Concept map
22 direct connections · 211 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

non-reasoning models cannot match reasoning models even with unlimited inference budget