Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The key finding from Snell et al. is that inference-time compute effectiveness varies dramatically based on how hard the prompt is relative to the base LLM's capabilities. A fixed compute budget applied uniformly across prompts is inefficient — easy prompts don't need much, hard ones need disproportionately more.

This motivates "compute-optimal" scaling: prescribing an adaptive, prompt-dependent strategy rather than a blanket allocation. The implication is significant: the same inference budget, reallocated adaptively, can substantially outperform a larger model given uniform compute. The question isn't how much total compute to spend, but how to spend it — and the answer depends on the prompt.

This shifts the design question from "how much inference compute?" to "which prompts should get more compute, and by how much?" — a harder question, but a more tractable one once you have a difficulty estimator.

Sub-token granularity via byte-level models: BLT (Byte Latent Transformer) implements adaptive compute at a fundamentally finer grain than prompt-level allocation. By operating on raw bytes and grouping them into variable-length patches based on next-byte entropy, BLT allocates more computation to high-entropy (surprising, information-dense) byte sequences and less to predictable ones. This is per-token adaptive compute realized without any explicit difficulty estimator — the entropy of the byte stream IS the difficulty signal. Combined with latent recurrence approaches that enable per-token adaptive depth, compute-optimal allocation now spans three granularity levels: prompt-level (Snell et al.), token-level (latent recurrence), and sub-token-level (BLT byte entropy). See Can byte-level models match tokenized performance with better efficiency?.

Model routing as a complementary optimization axis: RouteLLM, Hybrid-LLM, and Avengers-Pro (from Arxiv/Routers) demonstrate that which model handles a query is an independent optimization dimension alongside how much compute per query. Avengers-Pro routes via embedding-cluster scoring and surpasses GPT-5-medium by +7% or matches it at 27% lower cost. Hybrid-LLM adds a tunable quality threshold adjustable at test time. These two axes — compute allocation and model selection — are independent and composable: route to a smaller model AND give it less compute on easy queries, or route to a larger model AND give it more compute on hard ones. Compute-optimal allocation now spans four dimensions: prompt-level budget (Snell et al.), token-level depth (latent recurrence), sub-token granularity (BLT), and model selection (routing). See Can routers select the right model before generation happens? and Can routing beat building one better model?.


Source: Test Time Compute

Related concepts in this collection

Concept map
24 direct connections · 200 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

compute-optimal scaling allocates inference budget adaptively per prompt difficulty