# AI Can Learn Scientific Taste

> AI can learn scientific taste by training a model to predict which research paper will have higher impact using large-scale citation data as feedback.

- **Source:** [arXiv](https://arxiv.org/abs/2603.14473)
- **Published:** 2026-03-18
- **Permalink:** https://picx.dev/p/lH0SlD
- **Whiteboard:** https://picx.dev/p/lH0SlD/image

## Summary

# AI Can Learn Scientific Taste: Summary

## Summary (Overview)
*   **Proposes RLCF (Reinforcement Learning from Community Feedback):** A novel training paradigm that uses large-scale community feedback (e.g., citations) as supervision to teach AI models scientific taste—the capacity to judge and propose high-impact research ideas.
*   **Trains Scientific Judge:** A generative reward model trained on 700K field- and time-matched paper pairs from **SciJudgeBench** to predict which paper has higher potential impact (via citations). It significantly outperforms state-of-the-art LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes across time, fields, and peer-review preferences.
*   **Trains Scientific Thinker:** A policy model trained via **Comparison-Based GRPO** using *Scientific Judge* as a reward model. It learns to propose follow-up research ideas with higher potential impact than strong baselines, demonstrating improved scientific ideation.
*   **Demonstrates Scalability and Generalization:** Shows that learning scientific judgement scales log-linearly with data and model size. The learned "taste" transfers to future papers, unseen scientific fields, and different evaluation metrics (citations to peer-review scores).
*   **Key Finding:** Scientific taste is not a mystical human trait but a learnable objective from community signals, marking a significant step toward human-level AI scientists.

## Introduction and Theoretical Foundation
The paper defines **scientific taste** as the capacity to judge and propose research ideas with high potential impact, a hallmark of great scientists. While recent AI research has focused on improving AI scientists' *executive* capabilities (e.g., literature search, automated experimentation), enhancing their scientific taste remains underexplored.

The theoretical foundation draws from philosophical notions of taste (Hume, Kant) as a shared community standard rather than individual preference. In science, this community verdict is reflected through long-term interactions, primarily via **citations**, which serve as a proxy for a paper's impact.

The authors formalize key concepts:
*   **Potential Impact ($I(p)$):** The cumulative expected citations of a paper $p$.
    $$
    I(p) = \lim_{N \to \infty} \sum_{t=1}^{N} \mathbb{E}[c_t(p)]
    $$
    where $c_t(p)$ is citations in year $t$.
*   **Judgement Capability (`JudgeCap($\theta$)`):** A model $\theta$'s accuracy at comparing the impact of two papers $(p_a, p_b)$ from a matched distribution $\mathcal{D}$.
    $$
    \text{JudgeCap}(\theta) = \mathbb{E}_{(p_a,p_b)\sim\mathcal{D}}[\mathbb{1}[\text{Judge}_\theta(p_a, p_b) = y(p_a, p_b)]]
    $$
    where $y(p_a, p_b)=1$ if $I(p_a) > I(p_b)$.
*   **Ideation Capability (`ThinkerCap($\phi$)`):** The expected impact of ideas generated by a model $\phi$ given a seed paper $s$.
    $$
    \text{ThinkerCap}(\phi) = \mathbb{E}_{s\sim\mathcal{S}}[I(\text{Thinker}_\phi(s))]
    $$
*   **Scientific Taste:** The combination of high `JudgeCap` and high `ThinkerCap`.

The core idea is to formulate scientific taste learning as a **preference modeling and alignment problem**, using community feedback as the source of preference signals.

## Methodology
The proposed **Reinforcement Learning from Community Feedback (RLCF)** paradigm consists of three stages (see Figure 2):

1.  **Construct Community Preference:** Build **SciJudgeBench**, a dataset of 700K pairwise comparisons. For each pair, two paper abstracts from the same field and publication year are matched, with the higher-cited paper labeled as preferred. This controls for field and time biases in raw citation counts.

2.  **Preference Modeling (Train *Scientific Judge*):** Train a generative reward model to predict the preferred paper in a pair. Training uses **Group Relative Policy Optimization (GRPO)**. For an input $x$ (paper pair), the policy $\pi_\theta$ samples a group of $G$ outputs $\{o_i\}_{i=1}^G$ (each containing reasoning and a prediction). The reward $r_i$ is 1 if the prediction matches the true label $y$, else 0.
    $$
    r_i = \begin{cases} 1, & \text{if } \hat{y}(o_i) = y \\ 0, & \text{otherwise} \end{cases}
    $$
    The policy is updated to maximize a clipped surrogate objective with a KL penalty:
    $$
    \mathcal{J}(\theta) = \mathbb{E}_x \left[ \frac{1}{G} \sum_{i=1}^{G} \min\left( \rho_i \hat{A}_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]
    $$
    where $\rho_i = \pi_\theta(o_i|x) / \pi_{\text{old}}(o_i|x)$, $\hat{A}_i$ is the normalized advantage, and $\beta$ controls KL penalty strength.

3.  **Preference Alignment (Train *Scientific Thinker*):** Use the trained *Scientific Judge* as a reward model to train a policy model (*Scientific Thinker*) to generate high-impact research ideas. Since scoring a single idea is difficult, they employ **Comparison-Based GRPO**. For a seed paper prompt $x$, the policy samples $G$ candidate ideas $\{o_1, ..., o_G\}$. *Scientific Judge* conducts a round-robin pairwise comparison. The reward for idea $o_i$ is its win rate within the group:
    $$
    r_i = \frac{1}{G-1} \sum_{j \neq i} s(o_i, o_j)
    $$
    where $s(o_i, o_j)=1$ if $o_i$ wins against $o_j$. The policy is then updated using the same GRPO objective (Eq. 6).

## Empirical Validation / Results

### 4. AI Can Learn Scientific Judgement (*Scientific Judge*)
**Setup:** Models are trained on **SciJudgeBench** (696,758 pairs from arXiv through 2024). Evaluated on: 1) **In-domain** test set, 2) **Temporal OOD** (2025 papers), 3) **Metric OOD** (ICLR papers with peer-review scores), and 4) **Field OOD** (train on CS only, test on other fields).

**Key Results:**

1.  **Scaling Trends:** Performance improves log-linearly with more training data and with larger model size.
    > *"SciJudge-Qwen3-30B surpasses all listed proprietary baselines."*

2.  **In-Domain Performance:** *Scientific Judge* models substantially outperform their base models and SOTA LLMs.

    **Table 3: Main results on SciJudgeBench (in-domain test set). Pairwise accuracy (%).**
    | Model | CS | Math | Physics | Others | **Avg.** |
    | :--- | :--- | :--- | :--- | :--- | :--- |
    | **Qwen3-4B-Instruct** | 66.5 | 65.6 | 54.8 | 57.1 | **60.3** |
    | **SciJudge-Qwen3-4B** | **78.6 (+12.1)** | **74.6 (+9.0)** | **71.2 (+16.4)** | **79.8 (+22.7)** | **75.3 (+15.0)** |
    | **Qwen3-30B-A3B-Instruct** | 73.8 | 70.5 | 59.4 | 65.5 | **66.3** |
    | **SciJudge-Qwen3-30B** | **83.5 (+9.7)** | **78.7 (+8.2)** | **78.7 (+19.2)** | **82.3 (+16.8)** | **80.6 (+14.3)** |
    | **Gemini-3.0-Pro-Preview** | 81.1 | 73.0 | 72.6 | 76.5 | **75.7** |
    | **GPT-5.2-Thinking** | 79.1 | 68.8 | 69.4 | 73.1 | **72.7** |

3.  **Generalization:**
    *   **Temporal:** Maintains strong performance on future (2025) papers (Table 4).
    *   **Field:** Models trained only on CS data generalize well to Math, Physics, and other fields (Table 5).
    *   **Metric:** Transfers effectively from citation-based to peer-review-based preferences on ICLR papers (Table 6).

    **Table 6: Metric OOD results on ICLR papers (peer-review scores).**
    | Model | Accuracy |
    | :--- | :--- |
    | Qwen3-4B-Instruct | 65.3 |
    | **SciJudge-Qwen3-4B** | **79.1 (+13.8)** |
    | Qwen3-30B-A3B-Instruct | 76.8 |
    | **SciJudge-Qwen3-30B** | **87.7 (+11.0)** |

### 5. AI Can Learn Ideation with High Potential Impact (*Scientific Thinker*)
**Setup:** Train *Scientific Thinker* (4B and 30B parameters) using *Scientific Judge* as reward model. Evaluate by having strong LLM judges (GPT-5.2, GLM-5, Gemini 3 Pro) perform pairwise comparisons between ideas from the trained policy and a baseline, using majority vote.

**Key Results:**

1.  **Improved Ideation:** *Scientific Thinker* significantly outperforms its base policy.
    *   **SciThinker-30B** vs. base: **81.5%** in-domain win rate, **83.0%** out-of-domain.
    *   **SciThinker-4B** vs. base: **76.5%** in-domain win rate, **76.0%** out-of-domain.

2.  **Effective Reward Model:** *Scientific Judge* is a more effective reward model than the base LLM (Qwen3-4B-Instruct), leading to higher win rates for the trained policy (Figure 4).

3.  **Competitive with SOTA:** After training, *SciThinker-30B* achieves an average win rate of **54.2%** against three SOTA models (GPT-5.2, GLM-5, Gemini 3 Pro), surpassing them in head-to-head comparisons (Table 8).

    **Table 8 (a): In-Domain Win Rates (%) against SOTA models.**
    | Model | GPT-5.2 | GLM-5 | Gemini 3 Pro | **Avg.** |
    | :--- | :--- | :--- | :--- | :--- |
    | Qwen3-30B (Base) | 37.5 | 33.0 | 20.5 | **30.3** |
    | **SciThinker-30B** | **61.0 (+23.5)** | **58.5 (+25.5)** | **43.0 (+22.5)** | **54.2 (+23.9)** |

## Theoretical and Practical Implications
**Theoretical Implications:**
*   Provides a formal, learnable definition of "scientific taste" grounded in community feedback.
*   Demonstrates that community-level preferences, as captured by citations, can be effectively modeled and aligned with, bridging RLHF (human preferences) and RLVR (verifiable rewards) paradigms through the novel **RLCF** framework.
*   Shows that scientific judgement and ideation capabilities can be decoupled and improved separately via scalable machine learning techniques.

**Practical Implications:**
*   **AI-Assisted Research:** *Scientific Judge* can help rank new papers before they accumulate citations, aiding in literature review and grant allocation. *Scientific Thinker* can serve as a brainstorming assistant for generating promising research directions.
*   **Scalable Evaluation:** The RLCF paradigm offers a blueprint for using other large-scale, naturally occurring community signals (e.g., downloads, social media mentions) to train AI systems for other open-ended judgement tasks.
*   **Towards AI Scientists:** Represents a concrete step towards building AI systems that possess not just execution capabilities but also the strategic foresight characteristic of human experts.

## Conclusion
This work demonstrates that **AI can learn scientific taste**—the ability to judge and propose high-impact research ideas—from large-scale community feedback (citations). The proposed **Reinforcement Learning from Community Feedback (RLCF)** paradigm successfully trains:
1.  **Scientific Judge:** A model for scientific judgement that outperforms SOTA LLMs and generalizes robustly.
2.  **Scientific Thinker:** A model for scientific ideation that generates ideas with higher potential impact.

The results show that scientific taste is a learnable objective, moving beyond subjective preference to community-validated patterns. This marks a significant advancement toward developing AI scientists with human-like strategic judgement.

**Limitations & Future Work:** Includes the imperfect nature of citations as feedback, the need for more granular field categorization, evaluation reliance on LLM judges, and the use of only titles/abstracts. Future work could explore broader aspects of scientific taste, model citation dynamics, implement generated ideas, and incorporate richer paper context.

---

_Markdown view of https://picx.dev/p/lH0SlD, served by PicX — AI-generated visual whiteboard summaries of research papers._