Can diverse expert demonstrations exceed the knowledge of any single expert?
This explores whether a model trained on many imperfect experts can outperform every individual it learned from — and where that 'wisdom of crowds' effect breaks down.
This explores whether diverse expert demonstrations can add up to more than any single expert knew — and the corpus has a surprisingly direct yes, with sharp caveats. The clearest case: a model trained on many experts who each carry different biases can converge toward a consensus that beats them all. Through ordinary cross-entropy training, the model effectively runs an implicit majority vote — and low-temperature sampling reveals it, denoising the uncorrelated errors individuals make at critical decision points Can models trained on many imperfect experts outperform each one?. The mechanism is the same one behind crowd wisdom: if experts err in different directions, averaging cancels the noise and the signal survives.
But the gain depends entirely on diversity being *grounded*. Studies of multi-agent ideation show that cognitive diversity only lifts quality when the agents actually possess senior domain knowledge — diverse-but-shallow teams underperform even a single competent agent, because stimulation without expertise produces process loss rather than insight Does cognitive diversity alone improve multi-agent ideation quality?. So diversity is a multiplier, not a substitute. The same theme appears in training: agents kept diverse through role specialization (generators vs. critics trained on distinct data) avoid the collapse that limits single-agent finetuning to one productive round Can multiple agents stay diverse during training together?, and critique models inside the loop actively preserve solution diversity that would otherwise narrow toward a single mode Do critique models improve diversity during training itself?.
There's also a quieter, deeper result here: you may not even need a crowd. Critique fine-tuning on a *single* problem — just exposing a model to correct vs. incorrect reasoning — can activate reasoning comparable to full reinforcement learning Can a single problem unlock reasoning through solution critique?. And adversarial policy-critic training can recover an implicit reward signal from demonstrations alone, matching verifier-based methods in domains where no automatic checker exists Can reasoning emerge from expert demonstrations alone?. This reframes the question: it's less that quantity of experts adds knowledge, and more that *contrast between good and bad* is the active ingredient. Sequencing matters too — imitate first to build a foundation, then refine against rewards, beats either alone Does sequencing imitation then exploration training improve reasoning?.
The ceiling, though, is real. Demonstrations lock a learner into the imagination of whoever curated them — a model trained only on static expert data never meets the failures its experts didn't anticipate, so competence is capped by what curators imagined, not by what the agent could discover Can agents learn beyond what their training data shows?. Transcending individuals is not the same as transcending the *set*. The crowd denoises errors inside the demonstrated space; it cannot invent the parts of the space no expert ever showed.
And the corpus pushes back on the framing itself. A strand of notes argues that 'expert knowledge' was never just a stock of facts to be averaged — expertise is role performance: knowing when to speak, when to defer, and how a claim will land with a specific audience Is expertise really just knowing more than others? Can AI replicate the communicative work experts do?. A model can pool the factual layer and even exceed any single expert on it Can AI anticipate whether expert claims will be socially valid?, yet still miss the social calculus that gave those claims their force in the first place Can language models distinguish expert arguments from common assumptions?. So the honest answer is layered: yes, on the denoisable knowledge experts share, diversity genuinely exceeds the individual — but only within the curated frame, and only for the part of expertise that was ever reducible to demonstration.
Sources 12 notes
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Training generation and critic agents on distinct role-dependent data prevents the overfitting collapse that limits single-agent finetuning to one productive iteration. Removing critics or summarization degrades performance, confirming both components are critical.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.
RARO recovers implicit reward functions from expert demonstrations through adversarial co-training between a reasoning policy and relativistic critic. This approach matches verifier-based RL performance on reasoning tasks while extending to domains lacking automated verification.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Real expertise involves situational judgment—knowing when to speak, when to defer, which knowledge applies now, and how to communicate it to a specific audience. This role-performance dimension is at least as important as the underlying knowledge stock, and it is what AI cannot structurally perform.
Expertise requires anticipating audience acceptability and social validity, not just retrieving information. AI lacks the mechanism to perform this communicative work, making its fluent output epistemically misleading despite its confident form.
Expert claims are validity claims that succeed when both factually correct and socially acceptable within a community. AI can estimate statistical correctness but cannot anticipate contextual acceptability because it lacks embedded knowledge of expert communities' evolving standards.
LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.