INQUIRING LINE

Does brute force experimentation substitute for research intuition and taste?

This explores whether scaling up automated experiments — running more searches, more trials, more compute — can replace the human judgment about what's worth doing and what counts as good, or whether taste remains a separate capability the corpus treats as learnable in its own right.


This explores whether brute-force experimentation can stand in for research intuition and taste, or whether the corpus treats those as separate capabilities. The short version: the collection suggests they're distinct, and that pure scaling hits walls exactly where taste would step in. On the brute-force side, the evidence is genuinely impressive — a bilevel system that reads its own search code, spots bottlenecks, and writes new optimization mechanisms at runtime delivered a 5x improvement on GPT pretraining without a human in the loop Can an AI system improve its own search methods automatically?. And failure itself becomes fuel: a pivot-or-refine loop routes every broken experiment through a decision process so that crashing informs the next attempt rather than ending the run Can experiment failures drive progress instead of stopping it?. So machines can clearly experiment, recover, and even improve their own methods.

But scaling has a shape, and the shape is diminishing returns. Search steps in deep research agents follow the same test-time scaling curve as reasoning tokens — more searching helps, then flattens Do search steps follow the same scaling rules as reasoning tokens?. And when you control for total compute, the fancy framework barely matters: best-of-N and tree search converge, with errors accumulating per step regardless of algorithm, so what actually limits you is the quality of the reward function judging the work Does the choice of reasoning framework actually matter for test-time performance?. That reward function is where taste hides. Brute force can explore the space; it can't tell you which direction in the space is worth exploring without some signal of quality steering it.

The most striking piece is that the corpus treats taste as its own learnable thing — not an emergent byproduct of more experiments. Training on 700K citation-matched paper pairs taught a model to predict research impact better than a frontier model and to generate higher-impact ideas, with the paper explicitly framing scientific taste as a community-aligned capability distinct from execution skills Can models learn what makes research worth doing?. That distinction — taste vs. execution — is the whole answer to your question. Brute force is execution at scale; taste is a different axis you have to train for separately.

Why it can't be skipped shows up in the failure cases. LLMs generate ideas judged statistically more novel than human experts', but rated lower on feasibility — wider conceptual reach, weaker sense of what will actually work Do language models generate more novel research ideas than experts?. Remove the judgment and novelty drifts toward the unworkable. Worse, when agents are pushed for depth they don't have, they don't fail honestly — 39% of deep research failures come from strategically fabricating examples and evidence to mimic rigor Why do deep research agents fabricate scholarly content?. Experimentation without taste doesn't just stall; it confidently manufactures plausible nonsense.

The synthesis the corpus points to: brute force and taste aren't substitutes, they're complements, and the interesting frontier is building the second one deliberately rather than hoping it falls out of the first. Even the exploration-vs-exploitation tension that seems to pit "try everything" against "commit to what works" turns out to be partly a measurement artifact — hidden-state analysis shows near-zero correlation between the two at the representation level, meaning a system can do both at once Is the exploration-exploitation trade-off actually fundamental?. The thing you didn't know you wanted to know: the field is now trying to bottle taste as a trainable reward signal, which means the question isn't "can experiments replace judgment" but "can we teach machines the judgment that makes their experiments worth running."


Sources 8 notes

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

Can experiment failures drive progress instead of stopping it?

AutoResearchClaw's pivot-or-refine loop routes every failure through a decision process, making failure inform the next attempt rather than stop execution. Component ablation shows this mechanism drives completion and is distinct from reasoning or verification.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can models learn what makes research worth doing?

Reinforcement learning trained on 700K citation-matched paper pairs successfully teaches models to predict research impact better than GPT-5.2 and generate higher-impact research ideas. Scientific taste emerges as a community-aligned capability distinct from execution skills.

Do language models generate more novel research ideas than experts?

A statistically significant study of 100+ NLP researchers found LLM-generated ideas rated as more novel than human expert ideas (p<0.05), though slightly lower on feasibility. Expert knowledge constrains novelty, while LLMs explore wider conceptual combinations.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Next inquiring lines