How should we categorize different test-time scaling approaches?
Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
Every test-time scaling approach belongs to one of two categories:
- Internal TTS: Train the model so it generates long chain-of-thought reasoning autonomously, without external scaffolding. Requires SFT on long CoT data, RL to reinforce reasoning, or TTT (parameter updates at inference). The model self-organizes compute allocation. Examples: o1, DeepSeek-R1, QwQ.
- External TTS: Use inference-time infrastructure — search algorithms, verifiers, reward models — to steer a base model toward better outputs. The model's parameters are unchanged; compute is spent on search and evaluation. Examples: Best-of-N with PRM, MCTS, beam search, majority voting.
Internal and external TTS are complementary, not competing: internal TTS makes models better reasoners; external TTS extracts more performance from whatever reasoning capability exists. Combining them (e.g., using Best-of-N to boost a long-CoT model with a PRM) often outperforms either alone.
The practical distinction matters for deployment: internal scaling is a training cost paid once; external scaling is an inference cost paid per query. The economics push toward internal scaling at scale, but external scaling remains essential during development when training is expensive.
The finding that Can non-reasoning models catch up with more compute? illustrates the limits of external TTS alone: you need the internal foundation before external scaling can amplify it.
Source: Test Time Compute
Related concepts in this collection
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
the limit of external TTS without internal foundation
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
a cross-cutting axis that applies within each category
-
Can retrieval be scaled like reasoning at test time?
Standard RAG retrieves once, but multi-hop tasks need adaptive retrieval. Can we train models to plan retrieval chains and vary their length at test time to improve accuracy, the way test-time scaling works for reasoning?
CoRAG is a hybrid that escapes the internal/external binary: training teaches chain generation (internal) while compute dials (chain length/count) are applied at inference (external); retrieval-intensive tasks have their own TTS curve that this taxonomy did not originally capture
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
sleep-time compute fractures the dichotomy by adding a third temporal position: pre-interaction compute is neither internal (weights trained) nor external (inference-time search) but amortized pre-computation; the binary taxonomy needs a third category
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
challenges the taxonomy: latent recurrent depth-scaling is internal (architectural recurrence) but applied at inference (external compute dial), occupying a hybrid position the binary did not anticipate; verbalization is orthogonal to the internal/external split
-
Does RL teach reasoning or teach when to use it?
Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
reframes "internal TTS": if RL teaches *when* to activate latent capability rather than how to reason, then "internal TTS" is more accurately deployment-timing optimization than capability instillation; the foundation that external TTS amplifies was already in the base model
-
Can modular cognitive tools boost LLM reasoning without training?
Does structuring reasoning as discrete, sandboxed tool calls elicit stronger problem-solving in language models compared to monolithic prompting approaches, and can this approach match specialized reasoning models?
third-category instance: cognitive tools elicit reasoning at inference time without weight updates AND without external search infrastructure — neither internal nor external in the original sense; the taxonomy needs to distinguish "trained to reason" from "scaffolded to reason"
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
internal vs external tts is the primary taxonomic split in test-time scaling research