INQUIRING LINE

Which prompt properties determine whether variance helps under majority voting?

This explores what makes sampling variance an asset rather than a liability when you take the majority answer across many runs — i.e., which features of a prompt determine whether spread across samples converges on truth or just amplifies noise.


This explores what makes sampling variance an asset rather than a liability when you take the majority answer across many runs. The corpus converges on a single underlying principle: variance only helps when the model's answer distribution is already centered on the right answer, so consensus pulls toward truth instead of away from it. The clearest statement of this is a hard threshold — majority-vote reward works only when prior accuracy is above roughly 50%; below that line, the same voting mechanism silently amplifies wrong answers, because the consensus is confidently incorrect When does majority-vote reward actually help test-time learning?. So the first prompt property is simply: does this prompt sit in the regime where the model is more right than wrong? The same logic is what makes unlabeled self-improvement work at all — consensus answers tend to be correct, which is exactly why bootstrapping on majority votes can train a model with no ground truth Can models improve themselves using only majority voting?.


Sources 8 notes

When does majority-vote reward actually help test-time learning?

Test-time RL via consensus succeeds when prior accuracy exceeds ~50%, but below that threshold it silently amplifies wrong answers. Safe deployment requires gated probing per prompt class to confirm the favorable regime before training.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Does prompt optimization without inference strategy fail?

Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.

Why does majority voting outperform more complex inference methods?

Across benchmarks, majority voting empirically outperforms or matches Best-of-N and sequential revision approaches. Its robustness stems from avoiding unreliable verifiers, poor self-assessment, and unnecessary complexity—making it the right baseline for evaluating reasoning model improvements.

Next inquiring lines