Agentic Systems as Boosting Weak Reasoning Models

Paper · arXiv 2605.14163 · Published May 13, 2026
Test-Time Compute

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that “more agents help”: samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-k converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single GPT-5.4 nano proposal solves 67.0% of tasks.

Introduction. Boosting turns weak predictors into strong predictors by repeatedly combining imperfect but useful signals [1–3]. Modern language-model systems use a related idea at inference time: they sample several candidates, check or compare them, search over partial states, and select a final output [4– 9]. However, reasoning is not ordinary supervised boosting. In supervised prediction, each weak learner returns a label that can be evaluated against training examples. In reasoning, the system must instead generate an intermediate move, decide whether that move is useful, and avoid letting small local errors accumulate into a wrong final answer. We study this mechanism for verifier-backed reasoning tasks such as code repair, theorem proving, and program synthesis. These domains provide tests, proof checkers, type checkers, execution, or constraint solvers that can supply local soundness signals [10–15].

Discussion / Conclusion. The main lesson is that weak-model failure is often not a lack of information, but a failure to select it. On SWE-bench Verified, hidden-test-passing patches often appear in a pool of GPT-5.4 nano proposals even when a single nano call fails. Thus the central question is not only whether weak models can generate correct solutions, but whether an inference-time harness can identify them among several imperfect candidates. Our results show that this selection problem is substantially solvable, but not for free. Critics remove candidates with clear local defects, while comparators rank the remaining plausible patches. Together, they recover most of the oracle best-of-k gap, turning latent proposal-pool capability into actual solve rate. This explains how a committee of weak nano calls can approach much stronger standalone models: sampling exposes correct patches, and local selection makes them usable. The same decomposition also identifies the ceiling. When a correct patch is in the pool but the harness chooses another one, the bottleneck is identifiability: better critics, comparators, tests, or aggregation can help. When no correct patch appears, the bottleneck is proposal coverage: no selector can recover an absent solution.