Reinforcement Learning for LLMs

Does the choice of reasoning framework actually matter for test-time performance?

Explores whether different slow-thinking methods like BoN and MCTS produce meaningfully different outcomes, or whether total compute budget is the dominant factor determining reasoning success.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Rethinking External Slow-Thinking" provides the information-theoretic foundation for why different test-time scaling frameworks converge in effectiveness.

The mechanism is snowball errors: each reasoning step has a probability of error, and errors propagate — corrupting downstream steps. The probability of correct reasoning decreases with chain length. External slow-thinking methods (BoN, MCTS, ToT) mitigate this by expanding the search scope: generating multiple candidate paths and selecting among them. But the mitigation is determined by total compute budget, not by the specific framework.

The analysis compares BoN and MCTS formally. BoN generates N complete chains in parallel and selects the best. MCTS uses tree search to allocate compute more strategically across branches. In the "best case" for MCTS (maximally efficient branching) and "worst case" (degenerate branching), the probability of correct reasoning converges with BoN when the total number of reasoning steps is controlled.

The implication: the specific framework matters far less than (a) how much total compute you allocate, and (b) how reliable your value function is for path selection. An inaccurate reward function introduces selection costs that can decrease the probability of correct reasoning — the additional compute is wasted on bad selections.

This is the test-time analog of Does the choice of RL algorithm actually matter for reasoning?. That finding showed training-time RL algorithm choice doesn't matter because the pretrained prior sets the ceiling. This finding shows test-time framework choice doesn't matter because total compute and value function quality set the ceiling. The same "algorithm is interchangeable" principle operates at both levels.

The practical consequence: rather than investing in more sophisticated test-time frameworks, invest in (a) expanding the total inference budget, (b) improving the reward/value function used for selection, or (c) improving the model's base reasoning capacity. These produce sustained improvements. Framework engineering does not. This complements Can we allocate inference compute based on prompt difficulty?: compute-optimal scaling determines how to distribute budget across prompts (adaptively by difficulty), while this finding determines that within the allocated budget, the specific framework is irrelevant. The two together define the optimization space -- allocate adaptively across prompts, then use any framework within.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
13 direct connections · 165 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

external slow-thinking efficacy depends on total reasoning budget not framework choice — snowball error mitigation is compute-determined