Reinforcement Learning for LLMs

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

Note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The standard assumption in RL for LLMs is that ground-truth labels or a trained reward model are required. TTRL (Test-Time Reinforcement Learning) challenges this: by using majority voting across repeated samples as the reward signal, the model can train on unlabeled data at test time.

The logic is elegant: if you sample a question many times and a particular answer emerges as the majority, it's likely to be correct. That majority answer can be used as a pseudo-label for generating reward signals. The reward isn't perfect, but it's surprisingly effective — consistent enough to drive genuine policy improvement.

This opens a path toward model self-evolution that doesn't depend on human annotation or pre-trained reward models. The model uses its own inference-time behavior (its tendency to agree with itself) as a training signal. This is a form of bootstrapping: test-time compute enables reward estimation, which enables training, which improves the model.

The economic implication: as real-world tasks increase in complexity, large-scale annotation for RL becomes impractical. TTRL's approach to reward estimation from unlabeled data becomes increasingly important as a scaling strategy.


Source: Test Time Compute

Related concepts in this collection

Concept map
14 direct connections · 114 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

test-time rl on unlabeled data is possible using majority-vote reward estimation