AI Can Learn Scientific Taste

Paper · arXiv 2603.14473 · Published March 15, 2026

Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist’s executive capability, while enhancing an AI’s scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.

Great scientists possess not only technical skill but also strong judgement and foresight, qualities closely tied to what we call scientific taste [1, 2]. We use the term to refer to the capacity to judge and propose research ideas with high potential impact. While recent progress in building AI scientists has largely focused on improving their ability to search literature [3–6] and automated experimentation [7–11], enhancing an AI scientist’s scientific taste remains underexplored [12, 13].

Scientific taste is not simply a matter of subjective preference. Hume argued that a standard of taste can emerge from the joint verdict of qualified judges rather than arbitrary individual preference [14]. Kant [15] introduced taste as a kind of “sensus communis”, a shared sense that considers how others could judge, not merely personal. In the scientific context, such community verdict is reflected through long-term interactions within a research community. Work that aligns with this scientific taste is more likely to be reused and extended by subsequent studies. Ultimately, community feedback is expressed through signals, primarily through citations, which are the most common way to measure the impact of scientific research [16, 17].

We propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community feedback to construct community preference signal, and formulate scientific taste learning as a preference modeling and alignment problem [18–20]. To translate raw community feedback (e.g., citations) into learnable preference signals, we convert absolute feedback into matched pairwise comparisons and build SciJudgeBench [21, 22]. SciJudgeBench contains 700K pairs of paper abstracts (higher-cited vs. lower-cited), where each pair is matched by research field and publication time, so that the resulting pairwise signal more directly reflects the community’s preference for high potential impact ideas.

For preference modeling, we train Scientific Judge, a generative reward model [23–29]: it compares two papers based on its own evaluation rubric, then judges after reasoning and chooses the better one. Beyond serving as a reward model, Scientific Judge can rank newborn papers before they receive any citations. We train Scientific Judge with a reinforcement learning algorithm (GRPO) [30], assigning rewards based on whether its preference judgements are correct.

Learning to judge is only half the picture: a scientist must also propose promising directions. Therefore, using Scientific Judge as the reward model, we train a policy model via reinforcement learning called Scientific Thinker [23, 31]. Scientific Thinker generates scientific ideas with high academic value and potential impact, aligned with community preference. Human scientists typically develop new research ideas when inspired by a new paper. Similarly, we provide Scientific Thinker with the title and abstract of a paper, prompting it to propose a follow-up research idea with high potential impact after thinking.

Current training for AI Scientists is mainly targeting literature search [3–5, 35] and experiment execution [10, 11, 36–40]. However, these capabilities address how to carry out research rather than what research directions are worth pursuing. Human evaluations show that while LLMs can generate novel research ideas, they often struggle to reliably distinguish potentially high-impact directions from ideas that are superficially novel but trivial [41]. This gap constitutes a key difference between today’s AI Scientists and human experts, which we refer to as scientific taste, including (1) judging the scientific value of candidate ideas, and (2) proposing research questions, hypotheses, and methods with high potential impact.

Recent studies have explored leveraging LLMs to evaluate academic manuscripts, predict review scores, and generate feedback [42–47]. However, these works primarily employ language models as components in review pipelines, rather than enhancing the model’s intrinsic capability for scientific judgment. Prior works [48, 49] typically uses supervised fine-tuning (SFT) to train models on reviewer feedback, whereas we use community feedback through reinforcement learning to train models to judge and propose ideas with high potential impact, aligning it more closely with broader community preferences.

Reinforcement learning can be used to improve alignment [19]. Reinforcement Learning from Human Feedback (RLHF) [19, 20, 51] collects human preference annotations, trains a reward model to capture human preferences, and then optimizes a policy model with that reward, enabling better alignment to subjective preferences such as being helpful and harmless. Recent efforts further scale reward modeling and develop standardized benchmarks for evaluating reward models [18, 21, 22]. For tasks such as math and coding, Reinforcement Learning with Verifiable Reward (RLVR) [30, 52] instead leverages verifiable rewards provided by ground-truth answers, unit tests, or formal checkers, and has led to large gains in mathematical reasoning, code generation, and broader post-training pipelines [53–55].

However, RLVR is inherently tied to tasks with verifiable ground-truth, making it difficult to apply to open-ended tasks such as scientific judging and idea generation [52]. RLHF, on the other hand, is limited by its reliance on costly human annotations [19, 20] and inability to reflect community-level preferences through individual preferences alone. Our work proposes Reinforcement Learning from Community Feedback (RLCF), leveraging scalable community feedback signals which naturally emerge from community interactions, thereby inherently capturing community preferences.

Reinforcement Learning from Community Feedback

To learn scientific taste, we introduce Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision. RLCF proceeds in three stages: (1) construct community preference, where we collect community feedback signal to construct community preference data; (2) preference modeling, where we train Scientific Judge to predict potential impact of research ideas; and (3) preference alignment, where we use Scientific Judge as a reward model to supervise Scientific Thinker to generate scientific ideas with high potential impact.

3.1 Community Feedback as Supervision

We use citations as scientific community feedback signals, because citation count is a community verdict reflected through long-term interactions within a research community. High citation can represent the high impact of a scientific research [56]. To mitigate field and time biases in raw citation counts, we construct training data by pairing articles from the same field and year, where the one with significantly more citations serves as the preferred (higher-impact) item.

Each training example consists of two scientific ideas represented by their titles and abstracts [56, 57], with a binary label indicating which one has higher relative citations. We refer to the resulting dataset as SciJudgeBench, which transforms community feedback into pairwise supervision signals, enabling scalable preference learning.