When can weak models match strong model performance?
Can sampling many weak model calls replicate strong model results? This explores whether more attempts and selection mechanisms can bridge the performance gap without fundamentally stronger reasoning.
Can a committee of weak reasoning-model calls reach the performance of a much stronger model? The honest answer is "yes, but not because more agents help." The mechanism is boosting: sampling exposes latent correct solutions in the proposal pool, but critics and comparators must then recover them without access to the hidden verifier.
The paper separates four quantities — proposal coverage, local identifiability, progress, and diversity — and proves a sharp limit. Repeated sampling can amplify coverage, but coverage alone cannot create useful critics or comparators. Reliable amplification requires an additional local soundness signal: execution, proof checking, type checking, tests, or constraint solving. With it, rank-based bounds show when local selection errors compose into reliable trajectories. Without it, weak-model failure is revealed as a selection failure, not an information failure — on SWE-bench Verified, hidden-test-passing patches often appear in a pool of nano-model proposals even when a single call fails.
This identifies two distinct ceilings. When a correct patch is in the pool but the harness picks another, the bottleneck is identifiability — better critics, tests, or aggregation help. When no correct patch appears at all, the bottleneck is coverage — no selector can recover an absent solution. The result disciplines the "scale up agents" intuition: it pins the gain to verifiable domains and to the presence of a sound local check. It extends What limits how much models can improve themselves? to the committee setting — the verification advantage is what turns latent coverage into solve rate.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can expensive models efficiently support cheap models in production?
- Can weak models supervise the alignment of stronger models effectively?
- How much does workflow architecture matter compared to raw model capability in forecasting?
- Can test environments reliably predict how models behave in actual deployment?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
- How do complexity and diversity affect model performance differently?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
the local soundness signal is the verification side of that gap operating at inference time
-
Can large language models actually create executable plans?
Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.
why self-critique without an external sound verifier fails: no local soundness signal
-
When does adding more agents actually help systems?
Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.
"more agents" is conditional; this note supplies the verifiable-domain condition
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Agentic Systems as Boosting Weak Reasoning Models
- Automated Alignment Researchers: Using large language models to scale scalable oversight
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Large Language Model Reasoning Failures
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Original note title
a committee of weak model calls matches strong models only when a local soundness signal converts latent correct solutions into selections