INQUIRING LINE

Do novelty and feasibility always trade off in idea generation?

This explores whether the novelty-feasibility trade-off in idea generation is an iron law or an artifact of how we measure and structure ideation — and the corpus suggests it's softer and more decomposable than it looks.


This explores whether novelty and feasibility *always* pull against each other when generating ideas, or whether that tension is conditional. The pattern across the corpus is real but not absolute. On one side, LLMs reliably produce solutions that score high on feasibility and usefulness but low on novelty Why do LLMs excel at feasible design but struggle with novelty?. On the other, multiple studies find the mirror image: LLM research ideas rate as *more* novel than human experts' but slightly less feasible Do language models generate more novel research ideas than experts?, Why do LLMs generate more novel research ideas than experts?. So the trade-off shows up in both directions depending on the task — which is the first clue that it isn't a fixed law of nature.

The more interesting move is *why* the trade-off appears. One reading is that novelty without disciplinary constraint is cheap: LLMs explore wider conceptual combinations precisely because they don't carry the expert's instinct for what won't work Can LLMs generate more novel ideas than human experts?. The cost surfaces only at execution — when 43 researchers actually tried to build LLM-generated ideas over 100+ hours, the ideas dropped sharply, revealing impractical evaluation designs and missing technical groundwork that were invisible at the ideation stage Do LLM research ideas actually hold up when experts try to execute them?. Read this way, novelty and feasibility don't trade off at the moment of generation; the bill for feasibility just comes due later. The 'trade-off' is partly a deferral.

There's also a measurement story that should make you suspicious of treating the trade-off as fundamental. In the closely related exploration-exploitation framing, hidden-state analysis found near-zero correlation between the two — the apparent trade-off emerges only when you measure at the token level, and a method that targets the right representation improved both at once Is the exploration-exploitation trade-off actually fundamental?. The same caution applies here: if novelty and feasibility look opposed, it may be because of how they're scored, not because of an underlying conflict. Tellingly, LLMs that produce individually novel ideas often cluster them in narrow regions — high novelty per idea, low diversity across ideas Why do LLMs generate novel ideas from narrow ranges? — which is not what you'd expect if a clean novelty-feasibility dial were the whole picture.

What seems to actually shift the frontier is structure and expertise rather than a sacrifice of one axis for the other. Cognitive diversity in multi-agent teams improves ideation quality — but *only* when members carry genuine domain expertise; diversity without expertise underperforms even a single competent agent Does cognitive diversity alone improve multi-agent ideation quality?. Likewise, allocating compute to diverse abstractions produces structured breadth instead of either shallow novelty or narrow depth Can abstractions guide exploration better than depth alone?, and creativity researchers argue that combinational, exploratory, and transformational reasoning are distinct modes most current methods ignore entirely Can LLMs reason creatively beyond conventional problem-solving?. These point toward expanding the possibility space, not trading along a single line.

The quiet takeaway: the most promising route past the trade-off is decoupling *generation* from *evaluation*. LLMs generate novelty well but systematically avoid the evaluative stance feasibility requires Can LLMs generate more novel ideas than human experts?, and naive self-evaluation overestimates novelty by ~60% Why do LLMs generate more novel research ideas than experts?. But a structured pipeline that extracts claims, retrieves related work, and compares reached 86% alignment with human reviewers on novelty Can structured pipelines make LLM novelty assessment reliable?. If you let a system roam freely for novelty and then apply a separate, well-built feasibility filter, you don't have to buy feasibility with less novelty — you generate widely, then prune. The trade-off is most binding when one model is forced to do both jobs at once.


Sources 11 notes

Why do LLMs excel at feasible design but struggle with novelty?

Expert evaluation shows LLM-generated conceptual designs score higher on feasibility and usefulness but lower on novelty compared to crowdsourced human solutions. Few-shot learning further reduces diversity while improving quality alignment.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Why do LLMs generate more novel research ideas than experts?

Research shows LLM-generated ideas are statistically more novel than expert-produced ideas, but LLMs struggle to evaluate quality—automated evaluation overestimates by 60%. When executed, LLM ideas drop significantly on all metrics, suggesting novelty without feasibility.

Can LLMs generate more novel ideas than human experts?

LLMs produce more novel research ideas than experts because they lack disciplinary constraints, but they systematically avoid evaluative stance-taking required to assess feasibility or validity. Generation and evaluation are dissociated capabilities.

Do LLM research ideas actually hold up when experts try to execute them?

When 43 expert researchers implemented randomly-assigned ideas over 100+ hours, LLM-generated ideas declined significantly more than human ideas across all metrics. Execution revealed systematic weaknesses invisible at ideation, including impractical evaluation designs and missing technical groundwork.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Why do LLMs generate novel ideas from narrow ranges?

LLM-generated research ideas are rated individually novel but lack diversity, clustering in narrow generative regions. Combined with LLM self-evaluation failures, this limits the possibility space explored compared to human ideation across different conceptual territories.

Does cognitive diversity alone improve multi-agent ideation quality?

Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can LLMs reason creatively beyond conventional problem-solving?

Research identifies combinational, exploratory, and transformational reasoning as distinct creative modes grounded in cognitive science. Existing LLM reasoning methods address only conventional problem-solving, leaving creative paradigms unaddressed and potentially explaining diversity collapse in ideation.

Can structured pipelines make LLM novelty assessment reliable?

A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.

Next inquiring lines