What does pass@k reveal about base model reasoning capacity?
This explores what the pass@k metric — letting a model take many attempts and counting it a success if any one lands — tells us about whether reasoning lives inside base models already or gets installed by training.
This question is really asking whether reasoning is something a base model already has, or something post-training builds from scratch — and pass@k is the lens that exposes the difference. The corpus comes down hard on the "already has it" side. The clearest statement is that base models contain latent reasoning capability that minimal training merely unlocks: five independent techniques — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR — all surface reasoning that was sitting in base-model activations the whole time Do base models already contain hidden reasoning ability?. The punchline for pass@k is that post-training *selects* reasoning rather than *creating* it. A base model sampled enough times often produces the correct chain on its own; what RLVR-style training does is raise the odds that the first sample is the good one. So a high pass@k for the base model and a low pass@1 isn't a contradiction — it's evidence the capability was latent, and the bottleneck is elicitation, not acquisition.
That reframes a lot of the apparent "reasoning ceilings" you read about elsewhere in the collection. Several notes argue that when reasoning models collapse, the failure isn't a missing capability — it's something narrower. One shows collapses are execution failures, not reasoning failures: a text-only model can know an algorithm yet be unable to grind through its steps, and giving it tools lets it sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. Another finds breakdowns track instance-level *unfamiliarity* rather than task complexity — models succeed on any chain resembling something they've seen and stumble on novel instance structures Do language models fail at reasoning due to complexity or novelty?. Both fit the pass@k picture: capacity exists but is unevenly retrievable, and a single sample undersells what's actually in there.
But the corpus won't let you read pass@k as proof of deep symbolic competence either, and this is the part you didn't know you wanted to know. If base models can produce correct reasoning across enough samples, *what* are they producing? The skeptical notes suggest a lot of it is fluent imitation of reasoning form rather than genuine inference — chain-of-thought reproduces familiar schemata and degrades predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?. Models lean on semantic associations, not formal logic; strip the familiar semantics away and performance collapses even when the rules are handed to them Do large language models reason symbolically or semantically?. And reasoning traces themselves turn out to be unreliable witnesses — invalid logical steps drive nearly the same performance gains as valid ones Do reasoning traces show how models actually think?.
Put those together and pass@k reveals something double-edged. It shows base models carry a large reservoir of latent, distribution-bounded reasoning that training elicits rather than invents — which is why a base model with enough tries can match a tuned one. But it also inflates how much *genuine* reasoning you'd attribute to the model, because some fraction of those winning samples are well-shaped imitations that happen to land. The reservoir is real; its contents are a mix of competence and convincing pattern-completion.
If you want to push on the boundary, two notes sharpen it. One shows frontier reasoning models hit only ~20-23% on constraint-satisfaction problems demanding real backtracking — a ceiling pass@k can't sample its way past when the capability genuinely isn't latent Can reasoning models actually sustain long-chain reflection?. Another argues training regime, not inference budget, is what makes extra tokens productive — so simply cranking up k on a non-reasoning model won't close the gap with a model trained to reason Can non-reasoning models catch up with more compute?. The reservoir has a floor as well as a depth.
Sources 9 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.