When can weak models match strong model performance?

Can sampling many weak model calls replicate strong model results? This explores whether more attempts and selection mechanisms can bridge the performance gap without fundamentally stronger reasoning.

Synthesis note · 2026-06-03 · sourced from Test Time Compute

Can a committee of weak reasoning-model calls reach the performance of a much stronger model? The honest answer is "yes, but not because more agents help." The mechanism is boosting: sampling exposes latent correct solutions in the proposal pool, but critics and comparators must then recover them without access to the hidden verifier.

The paper separates four quantities — proposal coverage, local identifiability, progress, and diversity — and proves a sharp limit. Repeated sampling can amplify coverage, but coverage alone cannot create useful critics or comparators. Reliable amplification requires an additional local soundness signal: execution, proof checking, type checking, tests, or constraint solving. With it, rank-based bounds show when local selection errors compose into reliable trajectories. Without it, weak-model failure is revealed as a selection failure, not an information failure — on SWE-bench Verified, hidden-test-passing patches often appear in a pool of nano-model proposals even when a single call fails.

This identifies two distinct ceilings. When a correct patch is in the pool but the harness picks another, the bottleneck is identifiability — better critics, tests, or aggregation help. When no correct patch appears at all, the bottleneck is coverage — no selector can recover an absent solution. The result disciplines the "scale up agents" intuition: it pins the gain to verifiable domains and to the presence of a sound local check. It extends What limits how much models can improve themselves? to the committee setting — the verification advantage is what turns latent coverage into solve rate.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

When can weak models match strong model performa… What limits how much models can improve themselves… Can large language models actually create executab… When does adding more agents actually help systems…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the local soundness signal is the verification side of that gap operating at inference time
Can large language models actually create executable plans? Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.
why self-critique without an external sound verifier fails: no local soundness signal
When does adding more agents actually help systems? Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
"more agents" is conditional; this note supplies the verifiable-domain condition

When can weak models match strong model performance?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4