Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can inference compute replace scaling up model size?

Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Snell et al. (2024) demonstrated that allowing a model a fixed but non-trivial amount of inference-time compute can be more effective than scaling model parameters — at least on hard prompts. This suggests pretraining and inference compute are not fully independent: they trade off against each other.

The practical implication matters for deployment economics. Running a smaller model with more inference compute may be capability-equivalent to a larger model running with less. Inference is elastic (adjustable per query); pretraining is a sunk cost. This creates a new optimization lever that didn't exist when compute budgets only lived in training.

However, the substitution has limits. Base model capabilities set a floor — inference compute can extend performance within the model's existing capability frontier, but cannot create capabilities the model lacks entirely. See Can non-reasoning models catch up with more compute? for evidence of where this limit becomes visible.


Source: Test Time Compute

Related concepts in this collection

Concept map
22 direct connections · 166 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

test-time compute can substitute for model parameter scaling on hard prompts