Can models improve themselves using only majority voting?
Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.
The standard assumption in RL for LLMs is that ground-truth labels or a trained reward model are required. TTRL (Test-Time Reinforcement Learning) challenges this: by using majority voting across repeated samples as the reward signal, the model can train on unlabeled data at test time.
The logic is elegant: if you sample a question many times and a particular answer emerges as the majority, it's likely to be correct. That majority answer can be used as a pseudo-label for generating reward signals. The reward isn't perfect, but it's surprisingly effective — consistent enough to drive genuine policy improvement.
This opens a path toward model self-evolution that doesn't depend on human annotation or pre-trained reward models. The model uses its own inference-time behavior (its tendency to agree with itself) as a training signal. This is a form of bootstrapping: test-time compute enables reward estimation, which enables training, which improves the model.
The economic implication: as real-world tasks increase in complexity, large-scale annotation for RL becomes impractical. TTRL's approach to reward estimation from unlabeled data becomes increasingly important as a scaling strategy.
Source: Test Time Compute
Related concepts in this collection
-
Why does majority voting outperform more complex inference methods?
Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
majority voting here serves as a reward signal, not just an aggregation strategy
-
Can tree search replace human feedback in LLM training?
Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
parallel approach: MCTS derives quality signals from tree-search outcomes; TTRL from majority vote — both solve the annotation bottleneck without human labels
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
tension with X — both use sample-agreement as reward but differ on robustness: TTRL claims surprisingly effective policy improvement; the self-consistency analysis shows confident-but-wrong consensus is reinforced, predicting an upper bound on TTRL's gains and a hidden failure mode where the model becomes confidently incorrect on items where its prior was already wrong
-
Why does self-rewarding training collapse when responses improve?
Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?
extends: TTRL's majority-vote pseudo-labels suffer the same gradient-collapse pathology when the model converges to a single answer (no preference signal); temporal anchoring to past/future model versions provides a fix that majority-vote alone cannot
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
bounds the claim: TTRL operates inside the base model's reasoning boundary because majority-vote signal is constrained by the base model's mode; "self-evolution" in TTRL is sampling-efficiency improvement, not capability expansion
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
test-time rl on unlabeled data is possible using majority-vote reward estimation