LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Note · 2026-02-20 · sourced from Test Time Compute

In verifier-free inference-time compute experiments (Think Deep, Think Fast), non-reasoning models fall substantially behind reasoning models even when given an extremely high inference budget. The gap doesn't close with more compute — it just stays there.

This sets a hard limit on Can inference compute replace scaling up model size?. The substitution works within a training regime, but not across training regimes. A standard instruction-tuned model with more inference compute cannot replicate what a model trained specifically for extended reasoning can do, even given equivalent token budgets.

Why? Reasoning models have internalized the reasoning process through training — they know how to use additional tokens productively. Non-reasoning models don't have this structure, so additional tokens degrade into noise or verbosity rather than improved reasoning. The training regime instills the reasoning protocol that makes inference compute usable.

Qualification from targeted activation (Base Models paper): The gap is substantially closeable through targeted steering of base model activations without weight updates. A hybrid model using base model weights + thinking model deployment decisions recovers 91% of the performance gap while steering only 12% of tokens. This doesn't invalidate the finding — non-reasoning models without steering still fall behind — but it significantly changes what "non-reasoning model" means in practice. If capability already exists latently and steering can surface it, the gap is about deployment mechanisms, not raw capability. See Does RL teach reasoning or just when to use it?.

The imitation learning ceiling (Tutorial on LLM Reasoning): SFT/imitation learning creates an intelligence upper bound: the model is bounded by the quality of demonstrations it learns from, unable to surpass the skill level present in training data. RL + world models is the path beyond this ceiling, because RL allows discovery of strategies that exceed any individual demonstration. This provides the mechanism for why reasoning-specific training matters: it is not merely "more training" but training that enables exceeding the imitation ceiling.

This is a strong argument for the necessity of reasoning-specific post-training, not just inference-time tricks. Compute can amplify capability but cannot manufacture it. The dependency on training regime appears to be capability-specific: Can language models learn grammar from child-scale data? — syntactic competence scales down readily, achievable with human-scale data and the right composition. Reasoning capability requires the opposite: specialized training that instills the reasoning protocol itself. The lesson is not "you need a bigger model" but "you need the right training for the capability you want."

Source: Test Time Compute

Related concepts in this collection

Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the limit of this substitution
How should we categorize different test-time scaling approaches? Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
internal TTS addresses this gap through training
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
TTT is the bridge: it updates parameters at test time, potentially narrowing the gap between training regimes without full retraining
Can language models learn grammar from child-scale data? If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
contrast: syntactic competence doesn't require specialized training; reasoning does — reveals that training-regime dependencies are capability-specific
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
tension with X — both claim training matters, but differ on whether reasoning is created or activated: this note says reasoning models have internalized something non-reasoning models lack; the latent-capability finding shows base models already contain reasoning behaviors that minimal signals (steering, decoding tweaks) unlock; the gap may be activation, not capability
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
refines the claim: pass@k analysis shows RLVR reasoning models do not actually exceed their base model at high k — they just sample more efficiently at low k; the "non-reasoning vs reasoning" gap may be partly a sampling-efficiency gap that high-k aggregation reveals
Does RL teach reasoning or teach when to use it? Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
reframes the mechanism: the gap is not that reasoning models "internalize the reasoning protocol" but that they have learned *when* to activate latent reasoning circuitry; non-reasoning models have the circuitry but lack the activation policy

Concept map

22 direct connections · 211 in 2-hop network ·dense cluster

Can non-reasoning models catch up with more comp… Can inference compute replace scaling up model siz… How should we categorize different test-time scali… What makes test-time training actually work in pra… Can language models learn grammar from child-scale… Do base models already contain hidden reasoning ab… Does RLVR actually expand what models can reason a… Does RL teach reasoning or teach when to use it?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

non-reasoning models cannot match reasoning models even with unlimited inference budget