Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
Snell et al. (2024) demonstrated that allowing a model a fixed but non-trivial amount of inference-time compute can be more effective than scaling model parameters — at least on hard prompts. This suggests pretraining and inference compute are not fully independent: they trade off against each other.
The practical implication matters for deployment economics. Running a smaller model with more inference compute may be capability-equivalent to a larger model running with less. Inference is elastic (adjustable per query); pretraining is a sunk cost. This creates a new optimization lever that didn't exist when compute budgets only lived in training.
However, the substitution has limits. Base model capabilities set a floor — inference compute can extend performance within the model's existing capability frontier, but cannot create capabilities the model lacks entirely. See Can non-reasoning models catch up with more compute? for evidence of where this limit becomes visible.
Source: Test Time Compute
Related concepts in this collection
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
the strategy for how to exploit this substitution
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
the limit of this substitution
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
formalizes the substitution: conditional scaling laws separate training compute from inference efficiency, quantifying exactly how architectural choices (attention patterns, cache strategies) determine how much test-time compute can substitute for parameter scaling
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
orthogonal substitution mechanism: depth-recurrence in latent space adds inference compute without adding parameters or tokens, providing a third lever beyond test-time tokens and model size for the same hard-prompt substitution
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
operationalizes the prompt-difficulty selectivity this note implies: hybrid reasoning learns the difficulty estimator that decides which prompts deserve the substitution and which don't
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
test-time compute can substitute for model parameter scaling on hard prompts