AI Can Learn Scientific Taste: Summary

Summary (Overview)

  • Proposes RLCF (Reinforcement Learning from Community Feedback): A novel training paradigm that uses large-scale community feedback (e.g., citations) as supervision to teach AI models scientific taste—the capacity to judge and propose high-impact research ideas.
  • Trains Scientific Judge: A generative reward model trained on 700K field- and time-matched paper pairs from SciJudgeBench to predict which paper has higher potential impact (via citations). It significantly outperforms state-of-the-art LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes across time, fields, and peer-review preferences.
  • Trains Scientific Thinker: A policy model trained via Comparison-Based GRPO using Scientific Judge as a reward model. It learns to propose follow-up research ideas with higher potential impact than strong baselines, demonstrating improved scientific ideation.
  • Demonstrates Scalability and Generalization: Shows that learning scientific judgement scales log-linearly with data and model size. The learned "taste" transfers to future papers, unseen scientific fields, and different evaluation metrics (citations to peer-review scores).
  • Key Finding: Scientific taste is not a mystical human trait but a learnable objective from community signals, marking a significant step toward human-level AI scientists.

Introduction and Theoretical Foundation

The paper defines scientific taste as the capacity to judge and propose research ideas with high potential impact, a hallmark of great scientists. While recent AI research has focused on improving AI scientists' executive capabilities (e.g., literature search, automated experimentation), enhancing their scientific taste remains underexplored.

The theoretical foundation draws from philosophical notions of taste (Hume, Kant) as a shared community standard rather than individual preference. In science, this community verdict is reflected through long-term interactions, primarily via citations, which serve as a proxy for a paper's impact.

The authors formalize key concepts:

  • Potential Impact (I(p)I(p)): The cumulative expected citations of a paper pp. I(p)=limNt=1NE[ct(p)]I(p) = \lim_{N \to \infty} \sum_{t=1}^{N} \mathbb{E}[c_t(p)] where ct(p)c_t(p) is citations in year tt.
  • Judgement Capability (JudgeCap($\theta$)): A model θ\theta's accuracy at comparing the impact of two papers (pa,pb)(p_a, p_b) from a matched distribution D\mathcal{D}. JudgeCap(θ)=E(pa,pb)D[1[Judgeθ(pa,pb)=y(pa,pb)]]\text{JudgeCap}(\theta) = \mathbb{E}_{(p_a,p_b)\sim\mathcal{D}}[\mathbb{1}[\text{Judge}_\theta(p_a, p_b) = y(p_a, p_b)]] where y(pa,pb)=1y(p_a, p_b)=1 if I(pa)>I(pb)I(p_a) > I(p_b).
  • Ideation Capability (ThinkerCap($\phi$)): The expected impact of ideas generated by a model ϕ\phi given a seed paper ss. ThinkerCap(ϕ)=EsS[I(Thinkerϕ(s))]\text{ThinkerCap}(\phi) = \mathbb{E}_{s\sim\mathcal{S}}[I(\text{Thinker}_\phi(s))]
  • Scientific Taste: The combination of high JudgeCap and high ThinkerCap.

The core idea is to formulate scientific taste learning as a preference modeling and alignment problem, using community feedback as the source of preference signals.

Methodology

The proposed Reinforcement Learning from Community Feedback (RLCF) paradigm consists of three stages (see Figure 2):

  1. Construct Community Preference: Build SciJudgeBench, a dataset of 700K pairwise comparisons. For each pair, two paper abstracts from the same field and publication year are matched, with the higher-cited paper labeled as preferred. This controls for field and time biases in raw citation counts.

  2. Preference Modeling (Train Scientific Judge): Train a generative reward model to predict the preferred paper in a pair. Training uses Group Relative Policy Optimization (GRPO). For an input xx (paper pair), the policy πθ\pi_\theta samples a group of GG outputs {oi}i=1G\{o_i\}_{i=1}^G (each containing reasoning and a prediction). The reward rir_i is 1 if the prediction matches the true label yy, else 0.

    ri={1,if y^(oi)=y0,otherwiser_i = \begin{cases} 1, & \text{if } \hat{y}(o_i) = y \\ 0, & \text{otherwise} \end{cases}

    The policy is updated to maximize a clipped surrogate objective with a KL penalty:

    J(θ)=Ex[1Gi=1Gmin(ρiA^i,clip(ρi,1ϵ,1+ϵ)A^i)βDKL(πθπref)]\mathcal{J}(\theta) = \mathbb{E}_x \left[ \frac{1}{G} \sum_{i=1}^{G} \min\left( \rho_i \hat{A}_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

    where ρi=πθ(oix)/πold(oix)\rho_i = \pi_\theta(o_i|x) / \pi_{\text{old}}(o_i|x), A^i\hat{A}_i is the normalized advantage, and β\beta controls KL penalty strength.

  3. Preference Alignment (Train Scientific Thinker): Use the trained Scientific Judge as a reward model to train a policy model (Scientific Thinker) to generate high-impact research ideas. Since scoring a single idea is difficult, they employ Comparison-Based GRPO. For a seed paper prompt xx, the policy samples GG candidate ideas {o1,...,oG}\{o_1, ..., o_G\}. Scientific Judge conducts a round-robin pairwise comparison. The reward for idea oio_i is its win rate within the group:

    ri=1G1jis(oi,oj)r_i = \frac{1}{G-1} \sum_{j \neq i} s(o_i, o_j)

    where s(oi,oj)=1s(o_i, o_j)=1 if oio_i wins against ojo_j. The policy is then updated using the same GRPO objective (Eq. 6).

Empirical Validation / Results

4. AI Can Learn Scientific Judgement (Scientific Judge)

Setup: Models are trained on SciJudgeBench (696,758 pairs from arXiv through 2024). Evaluated on: 1) In-domain test set, 2) Temporal OOD (2025 papers), 3) Metric OOD (ICLR papers with peer-review scores), and 4) Field OOD (train on CS only, test on other fields).

Key Results:

  1. Scaling Trends: Performance improves log-linearly with more training data and with larger model size.

    "SciJudge-Qwen3-30B surpasses all listed proprietary baselines."

  2. In-Domain Performance: Scientific Judge models substantially outperform their base models and SOTA LLMs.

    Table 3: Main results on SciJudgeBench (in-domain test set). Pairwise accuracy (%).

    ModelCSMathPhysicsOthersAvg.
    Qwen3-4B-Instruct66.565.654.857.160.3
    SciJudge-Qwen3-4B78.6 (+12.1)74.6 (+9.0)71.2 (+16.4)79.8 (+22.7)75.3 (+15.0)
    Qwen3-30B-A3B-Instruct73.870.559.465.566.3
    SciJudge-Qwen3-30B83.5 (+9.7)78.7 (+8.2)78.7 (+19.2)82.3 (+16.8)80.6 (+14.3)
    Gemini-3.0-Pro-Preview81.173.072.676.575.7
    GPT-5.2-Thinking79.168.869.473.172.7
  3. Generalization:

    • Temporal: Maintains strong performance on future (2025) papers (Table 4).
    • Field: Models trained only on CS data generalize well to Math, Physics, and other fields (Table 5).
    • Metric: Transfers effectively from citation-based to peer-review-based preferences on ICLR papers (Table 6).

    Table 6: Metric OOD results on ICLR papers (peer-review scores).

    ModelAccuracy
    Qwen3-4B-Instruct65.3
    SciJudge-Qwen3-4B79.1 (+13.8)
    Qwen3-30B-A3B-Instruct76.8
    SciJudge-Qwen3-30B87.7 (+11.0)

5. AI Can Learn Ideation with High Potential Impact (Scientific Thinker)

Setup: Train Scientific Thinker (4B and 30B parameters) using Scientific Judge as reward model. Evaluate by having strong LLM judges (GPT-5.2, GLM-5, Gemini 3 Pro) perform pairwise comparisons between ideas from the trained policy and a baseline, using majority vote.

Key Results:

  1. Improved Ideation: Scientific Thinker significantly outperforms its base policy.

    • SciThinker-30B vs. base: 81.5% in-domain win rate, 83.0% out-of-domain.
    • SciThinker-4B vs. base: 76.5% in-domain win rate, 76.0% out-of-domain.
  2. Effective Reward Model: Scientific Judge is a more effective reward model than the base LLM (Qwen3-4B-Instruct), leading to higher win rates for the trained policy (Figure 4).

  3. Competitive with SOTA: After training, SciThinker-30B achieves an average win rate of 54.2% against three SOTA models (GPT-5.2, GLM-5, Gemini 3 Pro), surpassing them in head-to-head comparisons (Table 8).

    Table 8 (a): In-Domain Win Rates (%) against SOTA models.

    ModelGPT-5.2GLM-5Gemini 3 ProAvg.
    Qwen3-30B (Base)37.533.020.530.3
    SciThinker-30B61.0 (+23.5)58.5 (+25.5)43.0 (+22.5)54.2 (+23.9)

Theoretical and Practical Implications

Theoretical Implications:

  • Provides a formal, learnable definition of "scientific taste" grounded in community feedback.
  • Demonstrates that community-level preferences, as captured by citations, can be effectively modeled and aligned with, bridging RLHF (human preferences) and RLVR (verifiable rewards) paradigms through the novel RLCF framework.
  • Shows that scientific judgement and ideation capabilities can be decoupled and improved separately via scalable machine learning techniques.

Practical Implications:

  • AI-Assisted Research: Scientific Judge can help rank new papers before they accumulate citations, aiding in literature review and grant allocation. Scientific Thinker can serve as a brainstorming assistant for generating promising research directions.
  • Scalable Evaluation: The RLCF paradigm offers a blueprint for using other large-scale, naturally occurring community signals (e.g., downloads, social media mentions) to train AI systems for other open-ended judgement tasks.
  • Towards AI Scientists: Represents a concrete step towards building AI systems that possess not just execution capabilities but also the strategic foresight characteristic of human experts.

Conclusion

This work demonstrates that AI can learn scientific taste—the ability to judge and propose high-impact research ideas—from large-scale community feedback (citations). The proposed Reinforcement Learning from Community Feedback (RLCF) paradigm successfully trains:

  1. Scientific Judge: A model for scientific judgement that outperforms SOTA LLMs and generalizes robustly.
  2. Scientific Thinker: A model for scientific ideation that generates ideas with higher potential impact.

The results show that scientific taste is a learnable objective, moving beyond subjective preference to community-validated patterns. This marks a significant advancement toward developing AI scientists with human-like strategic judgement.

Limitations & Future Work: Includes the imperfect nature of citations as feedback, the need for more granular field categorization, evaluation reliance on LLM judges, and the use of only titles/abstracts. Future work could explore broader aspects of scientific taste, model citation dynamics, implement generated ideas, and incorporate richer paper context.