Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

For reasoning models, majority voting across independent samples is a surprisingly strong baseline that sophisticated inference-time methods struggle to beat. Think Deep, Think Fast finds it generally competitive with or outperforming Best-of-N (which requires an external reward model) and sequential revision methods (which require the model to self-evaluate).

The robustness comes from what majority voting doesn't do: it doesn't require a verifier (which can be wrong), it doesn't require self-assessment (which reasoning models are poor at), and it doesn't rely on trace length (which is negatively correlated with correctness). It just exploits statistical redundancy across independent samples.

This doesn't mean majority voting is optimal — it's a ceiling-limited strategy. But it's the right default: simple, interpretable, and hard to beat without investing significantly in verifier quality. The research implication is that gains from more complex methods should be benchmarked against majority voting, not against single-sample baselines. Many reported improvements in the literature may not survive this comparison.

Extreme decomposition + voting at million-step scale (MAKER): The MAKER framework pushes majority voting to its logical extreme by decomposing complex tasks into atomic subtasks executed by microagents, each validated by voting. At scale (1000+ steps), this achieves error-free execution that no single-agent approach matches. MAKER also reveals scaling laws for multi-agent systems: more agents improve performance on complex tasks but hurt simple tasks (communication overhead exceeds benefit), and there's a critical complexity threshold below which single agents dominate. This extends the majority-voting baseline finding: voting's robustness is not just a property of independent sampling at the problem level — it works at every level of decomposition, from whole-problem voting down to atomic-subtask voting. The practical implication: when individual subtask accuracy is high (>95%), voting over decomposed subtasks compounds reliability multiplicatively. See Can extreme task decomposition enable reliable execution at million-step scale?.


Source: Test Time Compute

Related concepts in this collection

Concept map
15 direct connections · 144 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

majority voting is more robust than best-of-n and sequential revisions for reasoning models