Why do some prompts benefit from aggregation while others do not?

This reads 'aggregation' as inference-time strategies that pool many samples — majority voting, best-of-N, self-consistency — and asks why pooling helps some prompts but wastes effort on others.

This explores why running a prompt many times and aggregating the answers (majority voting, best-of-N) pays off for some prompts and does nothing for others. The corpus points to one underlying variable: how much genuine uncertainty the model has on that specific prompt. Aggregation only helps when there's a spread of answers to pool — and that spread depends on the prompt, the model, and the question's difficulty, not on aggregation being universally good.

The clearest mechanism comes from work on prompt sensitivity and confidence Does model confidence predict robustness to prompt changes?. When a model is highly confident, it gives nearly the same answer every time — so sampling repeatedly and voting just returns the same thing at N times the cost. Aggregation earns its keep precisely on the low-confidence prompts where outputs swing run to run. The same logic surfaces in persona simulation Why do LLM persona prompts produce inconsistent outputs across runs?, where run-to-run variance can exceed the differences between distinct personas: there, the 'spread' is noise rather than signal, so aggregating it doesn't recover a stable answer, it just averages confusion. So variance is necessary but not sufficient — it has to be the right kind of variance.

The most direct answer is that aggregation can't be chosen in isolation from the prompt. One study found that optimizing a prompt without knowing the inference strategy systematically backfires, and that jointly tuning prompt and aggregation method yields up to 50% gains Does prompt optimization without inference strategy fail?. A prompt that's great for a single greedy answer is not the same prompt that's great for best-of-N — which means 'does aggregation help here' is partly a property of how the prompt was written for it.

Difficulty is the other lever. Compute-optimal scaling shows that effectiveness of extra inference compute varies sharply by prompt: hard prompts reward more samples, easy ones don't, and reallocating the same budget toward the hard cases beats spending uniformly Can we allocate inference compute based on prompt difficulty?. Instance-adaptive prompting sharpens this — for simple questions, a direct question-to-answer path beats elaborate reasoning, and forcing extra structure (the kind aggregation amplifies) can actively hurt Why do some questions perform better without step-by-step reasoning?. There's even a hint of where the variance lives mechanistically: only ~20% of tokens are high-entropy 'forking points' where the model could branch Do high-entropy tokens drive reasoning model improvements?. Prompts whose answers hinge on many such forks have real branching to aggregate over; prompts that don't, don't.

The thing you didn't know you wanted to know: aggregation isn't a quality booster you bolt onto every prompt — it's a bet that the prompt sits in a high-uncertainty, high-difficulty regime where the model's own branching produces a recoverable majority. Spend it there, and skip it where the model already knows. And note that all of this is model-dependent too — the prompt techniques that help cheap models often hurt strong ones Do prompt techniques work the same across all LLM tiers?, so the same prompt can be worth aggregating on one model and a waste on another.

Sources 7 notes

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Does prompt optimization without inference strategy fail?

Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Why do some questions perform better without step-by-step reasoning?

Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why do some prompts benefit from aggregation while others do not?

Sources 7 notes

Next inquiring lines