How do internal versus external test-time scaling approaches differ from precomputation strategies?
This explores the map of where reasoning compute gets spent — inside the model (internal), at inference via search and verification (external), or shifted off the critical path entirely (precomputation like sleep-time or post-completion compute) — and how those three are actually different axes, not rival techniques.
This explores the map of *where* and *when* a model spends its reasoning compute. The corpus treats internal vs. external as the main split: internal scaling trains the model to reason autonomously — it builds capability into the weights — while external scaling leaves the model fixed and squeezes more out of it at inference time through search, sampling, and verification How do internal and external test-time scaling compare?. The key reframe is that these complement rather than compete: internal raises the ceiling, external extracts performance up to it. Precomputation strategies are a *third* axis that cuts across both — they don't change how much compute you spend, they change *when*. Sleep-time and post-completion approaches do the thinking before the query arrives or after the answer is delivered, moving cost off the latency-critical path How should test-time scaling methods be categorized and designed?.
The cleanest way to see the difference is to notice that the internal/external distinction is about *whose* compute it is, while precomputation is about *scheduling*. A striking case is thinking-augmented pretraining: you generate reasoning traces and bake them into the training data, so what looks like a training-time investment is really test-time reasoning precomputed and amortized — it delivers 3x data efficiency, and harder tokens automatically attract longer traces, reproducing test-time scaling's adaptive allocation inside the pretraining loop Can training data augmentation match test-time compute scaling benefits?. That blurs the line: precomputation can turn an external trick into an internal capability.
What makes this more than taxonomy is a humbling result on the external side — the specific framework you pick (best-of-N vs. tree search like MCTS) matters far less than your total compute budget and the quality of your reward signal. They converge once you control for spend Does the choice of reasoning framework actually matter for test-time performance?. The same lesson recurs at the agent level, where ~80% of multi-agent performance variance is just token budget, not coordination cleverness How does test-time scaling work at the agent level?. So 'which external method' is often the wrong question; 'how is compute allocated' is the right one — and allocating adaptively per prompt beats fixed budgets How should we allocate compute budget at inference time?.
Underneath external scaling sits a structural trade-off worth knowing about: parallel (sample many short attempts, vote) vs. sequential (one long accumulating chain). They aren't interchangeable — on genuinely compositional problems like graph connectivity, sequential chain-of-thought has an *exponential* advantage because the answer requires accumulating intermediate results that short parallel chains can't reach How should we balance parallel versus sequential compute at test time? When does sequential reasoning beat parallel voting?. And the whole framing generalizes beyond reasoning: in deep-research agents, search steps follow the same scaling curve as reasoning tokens, so retrieval becomes just another compute axis you can scale How does search scale like reasoning in agent systems?.
The thing you didn't know you wanted to know: these aren't three competing camps but three knobs — *whose* compute (internal/external), *when* it runs (precomputation), and *how it's shaped* (parallel vs. sequential, adaptive vs. uniform) — and they trade against each other. Snell et al. showed inference compute can substitute for raw model size on hard prompts, meaning a small model that thinks longer can match a big one that doesn't — pretraining and inference are not separate budgets but exchangeable currency Can inference compute replace scaling up model size?.
Sources 10 notes
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.
Research identifies internal vs external as the primary taxonomic split for test-time scaling, with training-side constraints (policy entropy collapse) and novel directions that shift *when* compute happens (sleep-time, post-completion) rather than just *how much*. Methods like consensus games and recursive LMs sidestep traditional scaling tradeoffs.
Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.