Can weaker models match stronger ones with sufficient search and reasoning budget?

This explores whether throwing more inference compute — longer search, bigger reasoning budgets — lets a smaller or weaker model close the gap with a stronger one, or whether the gap is baked in earlier and compute can't buy your way out.

This explores whether a weaker model can simply *think longer* to match a stronger one — and the corpus splits sharply on what "weaker" even means. The cleanest answer is that **how a model was trained matters more than how much compute you give it at inference time.** Non-reasoning models don't catch up to reasoning models no matter how large their inference budget, because training instills a protocol that makes extra tokens *productive* rather than just longer — without it, more tokens are wasted motion Can non-reasoning models catch up with more compute?. So budget alone is the wrong lever; the lever is what kind of model is spending it.

But once you fix the training side, weaker-by-size models genuinely can match stronger ones. A small model trained with DPO on a large teacher's correct-and-incorrect examples reaches high accuracy on function-calling and math, because the negative examples target exactly the failures plain fine-tuning leaves in Can small models match large models on function calling?. Even more striking: student models trained on *pruned* reasoning chains — where only the symbolic-computation tokens are kept and the grammar and filler are stripped — outperform students trained on full frontier-model output Which tokens in reasoning chains actually matter most?. The small model didn't need the big model's budget; it needed the big model's *signal*, distilled.

The deeper surprise is that for many tasks, the extra reasoning budget buys the *strong* models almost nothing either. On constrained optimization, LLMs plateau at 55–60% satisfaction regardless of parameter count or training regime, and reasoning variants show no systematic edge over standard ones — extended chain-of-thought produces more text, not more actual iterative computation Do larger language models solve constrained optimization better? Do reasoning models actually beat standard models on optimization?. Frontier reasoning models hit only 20–23% on constraint-satisfaction problems requiring real backtracking Can reasoning models actually sustain long-chain reflection?. Where there's a ceiling, both weak and strong models sit under it — so the question "can the weak one catch up?" partly dissolves: there's less to catch up *to* than the leaderboards suggest.

And some of what looks like a reasoning gap turns out not to be reasoning at all. Model collapses on long procedures are often *execution* failures — the model knows the algorithm but can't carry out the steps at scale in pure text; give it a tool and it solves problems past the supposed cliff Are reasoning model collapses really failures of reasoning?. Failures track instance *novelty*, not task complexity: models pattern-match to instances they've seen rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. The unsettling implication is that a weaker model armed with the right tool or trained on the right instances can leapfrog a stronger model reasoning unaided — the bottleneck is execution bandwidth and familiarity, not raw horsepower.

The ceiling that compute can't break is more fundamental: a model can't reliably improve itself past the gap between *generating* an answer and *verifying* it — every dependable fix needs something external to validate it, so no amount of internal search or metacognition closes the loop alone What stops large language models from improving themselves?. That's why the productive moves in this corpus aren't "spend more budget" but "add a better signal": teacher preferences, confidence as a reward Can model confidence work as a reward signal for reasoning?, pruned traces, learned routing between thinking hard and answering fast Can models learn when to think versus respond quickly?. A weaker model can match a stronger one — but through what it's trained on and what it's allowed to call, not through sheer search.

Sources 11 notes

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can weaker models match stronger ones with sufficient search and reasoning budget?

Sources 11 notes

Next inquiring lines