Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does prompt optimization without inference strategy fail?

Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?

Note · 2026-02-23 · sourced from Inference time scaling
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

The standard practice treats prompt optimization and inference scaling as independent. Optimize the prompt first (via reward-based search, instruction tuning, etc.), then separately decide the inference strategy (best-of-N sampling, majority voting, etc.). IAPO demonstrates this decoupling is a methodological error with measurable cost.

The mechanism: different prompts generate responses with different distributional properties. Some prompts produce outputs that are individually strong but don't benefit from aggregation — their variance is low, so generating N samples and voting adds compute without improving quality. Other prompts produce outputs with higher variance but better centering — individually weaker, but under majority voting or best-of-N with a reward model, the aggregation exploits the variance to select high-quality responses. A prompt optimized at N=1 will favor the first type. But if the deployment uses N=8 with majority voting, the second type is strictly better.

This creates "deceiving prompts" — prompts that appear optimal in single-shot evaluation but become suboptimal (or harmful) under inference scaling. The PSST algorithm addresses this by treating prompt selection and inference scale as a joint contextual best-arm identification problem, exploring prompt-inference configurations together rather than sequentially.

The empirical results across six tasks: IAPO outperforms disjoint optimization by up to 25% and prompt-only optimization by up to 50%. The gains are consistent across mathematical reasoning, commonsense reasoning, and multi-objective text generation.

The practical implication for inference system design: any pipeline that separately optimizes prompts and inference strategies is leaving significant performance on the table. Since Can we allocate inference compute based on prompt difficulty?, the IAPO finding adds a second dimension — not just how much inference compute per prompt, but which prompt given the inference strategy. The two must be co-optimized.


Source: Inference time scaling

Related concepts in this collection

Concept map
13 direct connections · 178 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

prompt optimization decoupled from inference scaling produces systematic misalignment — joint optimization outperforms disjoint by up to 50 percent