Why do generative and discriminative language model procedures disagree?

This explores why a model asked to *generate* an answer and a model asked to *judge* answers (discriminate) often land in different places — and what that gap reveals about how language models work.

This explores why a model asked to *generate* an answer and a model asked to *judge* candidate answers can pull in different directions — the same underlying network, two different procedures, two different verdicts. The corpus frames this less as a bug than as a structural fact about how these systems compute. The cleanest statement of it is the generation-verification gap: a model's ability to *recognize* a good answer outruns its ability to *produce* one, and that asymmetry is exactly why self-improvement hits a ceiling — every reliable fix needs something external to validate it, because the generator and the verifier inside one model don't fully agree What stops large language models from improving themselves?. Disagreement, in other words, is baked into the architecture's two modes.

Why would the same weights disagree with themselves? One answer is that there is no single committed answer to begin with. A language model holds a *superposition* of plausible continuations and samples from it — regenerate the same prompt and you get different outputs, each internally consistent Do large language models actually commit to a single character?. The generative procedure draws one sample; the discriminative procedure scores the whole distribution. They're reading off different aspects of the same cloud, so naturally they can rank things differently. This also shows up below the surface: transformers sometimes compute a correct answer in early layers and then overwrite it to satisfy output formatting, meaning the 'generated' token can disagree with the model's own internal judgment recoverable from lower-ranked predictions Do transformers hide reasoning before producing filler tokens?.

The most direct treatment reframes the disagreement as a *game* rather than a defect. The Consensus Game casts decoding as a signaling problem where a generator and a discriminator must converge on the same answer; finding their equilibrium (Equilibrium-Ranking) lets a 7B model match a 540B one with no fine-tuning Can generative and discriminative models reach agreement?. The premise is that each procedure carries information the other lacks, and reconciling them recovers accuracy neither had alone. That's the optimistic reading: disagreement is signal you can mine.

There's a contrasting thread on *which* procedure to trust. When verification is reframed generatively — letting the judge reason in chain-of-thought before ruling — generative process reward models beat discriminative scorers using orders of magnitude less labeled data; a 1.5B GenPRM outperforms GPT-4o, and ThinkPRM surpasses full-dataset discriminative verifiers on 1% of the labels Can generative reasoning beat discriminative models with less training data?. So the gap isn't symmetric: a discriminator that merely pattern-matches a verdict can be worse than a generator that reasons its way to one. Disagreement partly tracks which procedure is allowed to *think*.

Underneath all of this is a shared root cause worth knowing about: both procedures are autoregressive probability machines, and their failures are predictable from response probability rather than logical difficulty Can we predict where language models will fail?. Where strong training priors dominate, a model will override its own context — the generated output drifts from what discrimination over the evidence would favor Why do language models ignore information in their context?. Read together, the corpus suggests the two procedures disagree because they sample the same probability landscape under different constraints — and the practical art is deciding whether to reconcile them (the Consensus Game), let the better one reason (generative verifiers), or treat the gap itself as the hard limit on what a model can fix about itself.

Sources 7 notes

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can generative and discriminative models reach agreement?

The Consensus Game frames decoding as a signaling game where generator and discriminator must agree on answers. Equilibrium-Ranking finds their joint policy, enabling 7B models to match 540B model performance without fine-tuning.

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do generative and discriminative language model procedures disagree?

Sources 7 notes

Next inquiring lines