Can proxy evaluation of ideas accurately predict their quality without implementation?
This explores whether we can judge an idea's quality from the idea alone — proxy evaluation by humans or LLMs — or whether real quality only shows up once someone actually builds it.
This reads the question as: can a cheap stand-in for execution — an expert skim, an LLM judge, a novelty score — tell us whether an idea is actually good? The corpus's sharpest answer is a warning. When 43 expert researchers spent 100+ hours actually implementing AI-generated research ideas, the ideas that had scored as *more* novel than human ones at the proposal stage collapsed across every metric once executed — execution surfaced impractical evaluation designs and missing technical groundwork that were simply invisible at the idea stage Do LLM research ideas actually hold up when experts try to execute them?. So the headline finding is that proxy evaluation can be confidently, systematically wrong, and wrong in the optimistic direction.
Why does the gap open? A recurring theme is that evaluators reward *form* over *substance*. Imitation-trained models fool human judges by adopting a confident, fluent style while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?. Chains of thought that are logically invalid score nearly as well as valid ones, because the model — and the evaluator — latches onto the look of reasoning rather than genuine inference Does logical validity actually drive chain-of-thought gains?. Models fine-tuned on labeled quality examples learn surface patterns instead of principled criteria and fail to transfer to new argument types Can models learn argument quality from labeled examples alone?. The pattern across all three: cheap proxies measure surface signals that correlate with quality on familiar cases and break exactly where you most need them — on the genuinely novel.
But the corpus doesn't say proxy evaluation is hopeless — it says *holistic* proxy evaluation is the problem, and structure is the fix. Decompose the judgment and reliability returns. A three-stage novelty pipeline (extract claims, retrieve related work, compare) hit 86% reasoning alignment with human ICLR reviewers, beating holistic LLM baselines Can structured pipelines make LLM novelty assessment reliable?. Breaking instruction-quality into verifiable sub-criteria via checklists reduces overfitting to superficial artifacts Can breaking down instructions into checklists improve AI reward signals?, and prompt quality itself turns out to have six measurable dimensions rather than one vibe Can we measure prompt quality independent of model outputs?. Agentic evaluation that actively collects evidence cut judge error 100x over a single LLM judge — though its memory module cascaded errors, a reminder that the evaluator can introduce its own failure modes Can agents evaluate AI outputs more reliably than language models?.
The deeper cross-cutting insight comes from the self-improvement work: there's a structural *generation–verification gap*. Pure self-improvement stalls because a system's ability to judge an idea is fundamentally weaker than its ability to produce one, and reliable methods only work by smuggling in external anchors — past versions, third-party judges, user corrections, tool feedback Can models reliably improve themselves without external feedback?. Implementation is the ultimate external anchor. This also reframes where ideation effort should go: multi-agent ideation only beats solo work when the agents carry real senior domain expertise; diversity without grounded knowledge underperforms a single competent agent Does cognitive diversity alone improve multi-agent ideation quality?. Expertise is, in effect, an internalized proxy for what execution would reveal.
What you didn't know you wanted to know: even simulated *humans* track this honesty gradient. AI persona panels replicated 76% of published experimental main effects, but their success correlated with the original p-value strength — they nailed the strong, obvious effects and turned unreliable exactly on the marginal ones Can AI personas reliably replicate human experiment results?. That's the through-line of the whole corpus: proxy evaluation is trustworthy in proportion to how obvious the answer already was. It predicts quality well where you needed it least, and fails where the idea is novel, marginal, or untested — which is precisely the territory where you were hoping the proxy would save you the cost of building.
Sources 11 notes
When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
Research identifies six evaluable dimensions—Communication, Cognition, Instruction, Logic, Hallucination, and Responsibility—with 20 sub-criteria based on Grice, cognitive load theory, and instructional design. Improvements in one dimension cascade to others, revealing prompt quality as a structured space rather than a flat checklist.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.