Reinforcement Learning for LLMs Language Understanding and Pragmatics Design & LLM Interaction

Can machines learn what makes research worth doing?

Can AI systems trained on community citation patterns learn to recognize high-impact research directions the way human scientists do? The research explores whether 'scientific taste'—judgment about what to pursue—is learnable from collective community signals.

Note · 2026-04-01 · sourced from Reinforcement Learning

Most AI scientist research focuses on execution — literature search, experiment design, data analysis. RLCF addresses a different capability: what research directions are worth pursuing. This is the judgment capacity the authors call "scientific taste."

The training paradigm: Reinforcement Learning from Community Feedback (RLCF) uses citation counts as community feedback signals. To mitigate field and time biases, training data consists of 700K pairs of paper abstracts matched by field and publication year, where the higher-cited paper serves as the preferred (higher-impact) item.

Two trained models:

Scientific Judge — a generative reward model that compares two papers, reasons about their relative impact, and chooses the better one. Outperforms GPT-5.2, Gemini 3 Pro, and other SOTA LLMs at predicting impact. Generalizes to future-year test sets, unseen fields, and peer-review preferences.
Scientific Thinker — a policy model trained via RL with Scientific Judge as reward model. Given a paper's title and abstract, it proposes follow-up research ideas with higher potential impact than baselines.

The theoretical framing is significant. The authors invoke Hume: "a standard of taste can emerge from the joint verdict of qualified judges rather than arbitrary individual preference." And Kant: taste as "sensus communis" — a shared sense that considers how others could judge. Scientific taste is not personal preference. It is alignment with community judgment. RLCF operationalizes this: the reward signal comes from community behavior (citations), not individual annotation (RLHF) or formal verification (RLVR).

The three RL paradigms now distinguished:

RLHF — individual human preferences (costly, limited to annotator capacity)
RLVR — verifiable ground truth (math, code — limited to tasks with objective answers)
RLCF — community-level feedback (scales with community size, captures collective judgment)

Since Can AI predict social norms better than humans?, RLCF is the training analog: the model learns to predict community preference (which papers will be cited) without participating in the community process that produces citations. It predicts taste without having taste. This is the same prediction-without-participation pattern — now as an explicit training objective.

Since Can AI ever gain expert community trust through participation?, RLCF trains the model to bypass the validation circle entirely — learning what the circle would approve without joining it. The epistemological implications for the Tokenization series are direct: this is a machine that learns to produce knowledge-tokens calibrated to community acceptance, without the community process that gives acceptance its warrant.

Source: Reinforcement Learning Paper: AI Can Learn Scientific Taste

Related concepts in this collection

Can AI predict social norms better than humans? Explores whether language models can achieve superhuman accuracy at predicting what communities find socially appropriate, and what that capability reveals about the difference between prediction and genuine participation.
RLCF is the training-level version: learning community preference without community participation
Can AI ever gain expert community trust through participation? Explores whether AI can accumulate the social capital and track record that human experts build within their communities. Questions whether prediction of social norms equals genuine participation in expert validation processes.
RLCF trains bypass of the validation circle
Why does RL succeed more on some tasks than others? Reinforcement learning shows wildly different improvement rates across conversational tasks—from near-total capability unlock to modest gains. What determines whether RL will transform performance or produce incremental progress?
RLCF introduces a third reward type: community-level feedback as neither binary verification nor individual judgment
Do LLM research ideas actually hold up when experts try to execute them? Explores whether LLM-generated ideas maintain their apparent novelty advantage when expert researchers spend 100+ hours implementing them. Matters because ideation-stage evaluation may not capture real-world feasibility barriers.
Scientific Thinker addresses ideation quality; whether execution quality follows is untested

Concept map

15 direct connections · 120 in 2-hop network ·medium cluster

Can machines learn what makes research worth doi… Can AI predict social norms better than humans? Can AI ever gain expert community trust through pa… Why does RL succeed more on some tasks than others… Do LLM research ideas actually hold up when expert…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reinforcement learning from community feedback trains scientific taste by using citation-based community preferences as reward signal — separating judgment from execution