Can test-time compute on smaller models replace larger model inference?

This explores whether you can spend more compute at inference time on a smaller model to get the performance of a bigger one — and the corpus says it works, but only within limits set by how the smaller model was trained.

This explores whether throwing extra inference compute at a smaller model can stand in for running a larger one. The short answer from the corpus is a qualified yes — and the qualifications are the interesting part. Snell et al. found that on hard prompts, a smaller model given more thinking time can match a much larger one, which means pretraining compute and inference compute aren't separate budgets but tradeable against each other Can inference compute replace scaling up model size?. The catch is that the trade only pays off when compute is spent where it matters: spending the *same* total budget adaptively — little on easy prompts, lots on hard ones — beats a bigger model running on a flat budget Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?.

But there's a ceiling, and it's set by training, not compute. A model that was never trained to reason can't be rescued by an unlimited inference budget — the gap between reasoning and non-reasoning models persists no matter how many tokens you spend, because training installs a protocol that makes those extra tokens *productive* rather than just longer Can non-reasoning models catch up with more compute?. This is the crucial boundary on the substitution: extra compute extracts capability the model already has; it doesn't manufacture capability it lacks. The cleaner way to see it is the field's main taxonomic split — *internal* scaling (training the model to reason on its own) builds the capability, while *external* scaling (search, sampling, verification at inference) extracts performance from whatever capability exists. They're complements, not rivals How do internal and external test-time scaling compare?.

There's also a sobering question of what extra inference compute is actually *doing*. One line of work argues that longer thinking traces don't reason better — they just widen the output distribution so it covers the right answer more often, and past a threshold the distribution gets too diffuse and accuracy drops again Does extended thinking actually improve reasoning or just increase variance?. A related information-theoretic result finds that the fancy framework barely matters: best-of-N and tree search converge once you control for total compute and the quality of the reward signal Does the choice of reasoning framework actually matter for test-time performance? Can reasoning systems scale wider instead of only deeper?. So the substitution is real, but it's closer to "sampling the solution space harder" than "thinking more deeply" — which is why it helps most on problems where the model can occasionally get the answer right but isn't reliable.

The most efficient frontier isn't pure inference scaling at all — it's deciding *when* to spend. Thinkless trains a model to route between extended reasoning and quick answers, avoiding wasted compute without difficulty labels Can models learn when to think versus respond quickly?, and a striking result folds the inference trick back into training: augmenting pretraining data with generated reasoning traces gives 3B models a 3x data-efficiency gain, with harder tokens automatically getting longer traces Can training data augmentation match test-time compute scaling benefits?. If you're starting from a strong teacher, there are other cheap routes to small-model competence too — DPO on a teacher's right-and-wrong examples lets small models match large ones on structured tasks Can small models match large models on function calling?, and decoding-time proxy tuning steers a small model with a larger one's behavior while leaving its knowledge intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

The thing you might not have known you wanted to know: "replace larger model inference" isn't one question but two. On reasoning-heavy, hard prompts where the smaller model already has the latent skill, yes — adaptive test-time compute genuinely substitutes for size. On tasks demanding capability the small model was never trained for, no amount of inference compute closes the gap, and the real lever is what you put into training, not what you spend at inference.

Sources 12 notes

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can test-time compute on smaller models replace larger model inference?

Sources 12 notes

Next inquiring lines