Can generative and discriminative models reach agreement?
Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?
Language models offer two fundamentally different ways to answer questions. Generatively: sample the most probable answer. Discriminatively: score candidate answers and pick the best. These two procedures often disagree — generative decoding fails when probability mass spreads across contradicting answers; discriminative decoding fails due to miscalibration or sensitivity to question wording. Both are noisy, and their noise is not correlated.
The Consensus Game formalizes this as a regularized imperfect-information sequential signaling game. A Generator agent must communicate an abstract correct/incorrect value to a Discriminator agent, but can only do so using natural language strings from a candidate set. An effective joint policy is one where both agents agree on which strings map to "correct." The resulting decoding algorithm — Equilibrium-Ranking — finds approximate equilibria of this game.
The results are striking: LLaMA-7B with Equilibrium-Ranking outperforms LLaMA-65B and PaLM-540B on multiple benchmarks spanning reading comprehension, commonsense reasoning, mathematical problem-solving, and dialogue. A 7B model matching a 540B model is a ~77x parameter efficiency gain.
The insight is that generative and discriminative procedures contain complementary information. Neither alone captures the model's "best guess at the truth." The game-theoretic framework extracts a consensus signal that is more reliable than either procedure individually — analogous to how ensemble methods combine weak learners, but operating within a single model's two modes of operation.
This is a training-free method — no fine-tuning required. The computational overhead comes from finding the equilibrium at inference time, making it a form of test-time compute scaling. Since Can inference compute replace scaling up model size?, Equilibrium-Ranking provides a concrete mechanism: the test-time compute goes into reconciling the model's own internal disagreements rather than generating longer reasoning chains.
The connection to multi-agent debate is suggestive. Since Why do multi-agent LLM systems converge without real debate?, the Consensus Game forces genuine deliberation between two perspectives (generative and discriminative) within a single model — the equilibrium constraint prevents premature convergence because both agents must independently arrive at consistent signals. And since When does debate actually improve reasoning accuracy?, the Consensus Game sidesteps the evidence-verification problem that plagues inter-model debate: both "agents" operate within the same model's knowledge, so there is no risk of one agent persuading the other with rhetorically superior but factually wrong arguments -- the equilibrium constraint forces agreement on what the model actually knows rather than what it can argue most convincingly.
Source: Question Answer Search
Related concepts in this collection
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
Equilibrium-Ranking is a specific mechanism: test-time compute spent reconciling internal disagreements
-
Why do multi-agent LLM systems converge without real debate?
When multiple AI agents reason together, do they genuinely deliberate or just accommodate each other's views? Research into clinical reasoning systems reveals how often agents reach agreement without substantive disagreement.
game-theoretic equilibrium prevents premature convergence
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
Consensus Game implicitly parallelizes by running both generative and discriminative procedures
-
Can disagreement be resolved without either party fully yielding?
Explores whether dialogue can move past winner-take-all debate or forced consensus to genuine mutual adjustment. Matters for AI systems that need to work through real disagreement with users.
Consensus Game is mechanistic dialectical reconciliation within a single model
-
Can models trained on many imperfect experts outperform each one?
Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
training-time analog: transcendence extracts consensus from diverse human experts encoded in weights, Consensus Game extracts consensus between a single model's generative and discriminative modes; both demonstrate that aggregation over diverse perspectives outperforms any single perspective
-
When does debate actually improve reasoning accuracy?
Multi-agent debate shows promise for reasoning tasks, but under what conditions does it help versus hurt? The research explores whether debate amplifies errors when evidence verification is missing.
Consensus Game sidesteps debate's evidence-verification problem: both "agents" share the same knowledge, so equilibrium forces agreement on actual knowledge rather than rhetorical persuasion
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
game-theoretic equilibrium between generative and discriminative LM decoding reconciles their inconsistent predictions — small models with consensus match models 100x larger