What distinguishes minimal-pair asymmetry from standard accuracy evaluation?

This explores a diagnostic trick — comparing a model's score on near-identical inputs that differ by one element — and why that gap reveals things a single accuracy number cannot.

This explores how a minimal-pair test (run the model on two almost-identical prompts that differ by one element, then watch how the score *moves*) exposes mechanism, while standard accuracy only reports the final score on one version. The clearest example in the corpus is constraint reasoning: when researchers stripped the constraints out of problems, twelve of fourteen models got *worse* — dropping up to 38.5 points — which is the opposite of what genuine reasoning predicts Are models actually reasoning about constraints or just defaulting conservatively?. On a standard benchmark those same models look like they're reasoning about the constraints. The asymmetry between the paired conditions is what gives the lie away: they were defaulting to the harder-looking answer, not evaluating anything. Accuracy measures whether you got it right; the minimal pair measures *why*.

The deeper point is that a single accuracy figure is a lossy summary of what's happening inside the model. The corpus has a striking demonstration that two models can post identical scores while one has clean internal organization and the other is internally fractured — all the task-relevant features are linearly decodable, yet the structure is broken in ways that only surface under perturbation or distribution shift Can models be smart without organized internal structure?. Standard evaluation is blind to that difference by construction. Minimal-pair asymmetry is one of the few cheap, behavioral ways to poke at the same hidden gap without cracking the model open.

The same blindness shows up in other failure modes the corpus catalogs. Models asked to perform iterative numerical optimization emit plausible, template-shaped answers that are simply wrong — they recognize the *shape* of the problem and pattern-match a memorized solution rather than executing the procedure Do large language models actually perform iterative optimization?. A pass/fail accuracy check on a few problems can miss this entirely; you only catch it by varying the inputs and watching the answers fail to track the changes the way real computation would. The common thread: high accuracy plus the wrong *response to controlled variation* equals a model that's shortcutting.

There's an adjacent lesson worth pulling in, because it's the same illusion from a different angle. Setting temperature to zero makes outputs perfectly consistent — and people read consistency as reliability — but a deterministic output is still just one draw from a probability distribution, and repeated testing shows that consistency and correctness are not the same property Does setting temperature to zero actually make LLM outputs reliable?. Minimal-pair asymmetry and reliability testing are cousins: both refuse to take a clean-looking single number at face value and instead ask what *moves* when you change the conditions around it.

So the distinction isn't 'minimal-pair is a harder benchmark.' It's a different *kind* of measurement. Standard accuracy asks 'is the output correct?' Minimal-pair asymmetry asks 'does the output respond to changes the way a genuinely reasoning system would?' — and that second question is the one that separates a model that understands from a model that has learned which answer usually wins.

Sources 4 notes

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

What distinguishes minimal-pair asymmetry from standard accuracy evaluation?

Sources 4 notes

Next inquiring lines