# RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

> RubricEM trains research agents by using rubrics to structure policy execution, judge feedback, and memory, achieving state-of-the-art performance on long-form research benchmarks.

- **Source:** [arXiv](https://arxiv.org/abs/2605.10899)
- **Published:** 2026-05-14
- **Permalink:** https://picx.dev/p/UZqgjW
- **Whiteboard:** https://picx.dev/p/UZqgjW/image

## Summary

# RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

## Summary (Overview)
*   **Core Contribution:** Introduces **RubricEM**, a reinforcement learning framework for training deep research agents that produce long-form reports in domains lacking verifiable ground-truth rewards. It uses rubrics as a shared interface for structuring policy execution, judge feedback, and agent memory.
*   **Key Methodology:** Combines three main components:
    1.  **Rubric-guided Structured Scaffold:** Imposes explicit stage structure (Plan → Research → Review → Answer) on agent trajectories, with self-generated rubrics guiding decisions.
    2.  **Stage-Structured GRPO (SS-GRPO):** Provides denser credit assignment by scoring each stage separately with stage-specific, evolving judge rubrics, rather than broadcasting a single terminal reward.
    3.  **Reflection Meta-Policy Training:** Jointly trains a shared-backbone meta-policy to distill judged trajectories into reusable, rubric-grounded reflections stored in a "rubric bank" for future task attempts.
*   **Empirical Results:** The resulting **RubricEM-8B** model achieves state-of-the-art performance among comparable open models on four long-form research benchmarks (average score 55.5), outperforms prior RL systems with fewer training steps, and approaches proprietary systems.
*   **Theoretical Underpinning:** Formal analyses show the value of explicit stage information for policy value, the conditions under which stagewise credit assignment improves gradient approximation, and how judge-gated reflection training can co-evolve with the task policy.
*   **Efficient Training:** An asynchronous pipeline design allows meta-policy training to run concurrently with task-policy rollouts, avoiding the sequential bottlenecks common in prior meta-RL work.

## Introduction and Theoretical Foundation
Training deep research agents—systems that autonomously plan, search, evaluate evidence, and synthesize long-form reports—pushes reinforcement learning (RL) beyond the regime of **verifiable rewards**. The outputs lack ground-truth answers, trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for converting past attempts into reusable experience.

The central question addressed is: **How can RL train deep research agents beyond verifiable rewards, while enabling long-horizon credit assignment and learning from experience?**

**Theoretical Foundation:** The paper argues that **rubrics** (evaluation criteria) should serve not merely as final-answer evaluators, but as the **shared interface** that structures policy execution, judge feedback, and agent memory. This view is inspired by an **Expectation–Maximization (EM)** perspective: the latent structure of an open-ended task (what matters, where credit belongs, what should be remembered) is *estimated* through rubrics, which then *condition* policy reasoning, judge scoring, and memory evolution.

**Formal Value of Stage Information (Theorem 1):** A key theoretical insight formalizes the benefit of explicit stage decomposition. Let $h$ denote a random decision point, $c = \phi(h)$ a compressed state representation, $z$ the current stage label, and $U(h, a)$ the expected downstream value of action $a$. Define the value functions for a flat policy and a stage-aware policy:
$$V_{\text{flat}} := \mathbb{E}\left[\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c]\right], \quad V_{\text{stage}} := \mathbb{E}\left[\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z]\right].$$
If there exists a context set $\mathcal{C}_0$ with positive probability and two task-relevant stages such that for every $c \in \mathcal{C}_0$, $p(z|c) > 0$, $p(z'|c) > 0$, and $\arg\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z] \cap \arg\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z'] = \emptyset$, then **$V_{\text{stage}} > V_{\text{flat}}$**. This shows that explicit stage information strictly improves policy value when the optimal actions for different stages disagree given the same compressed context.

## Methodology
RubricEM consists of three integrated components.

### 1. Structured Reasoning Scaffold
The agent's trajectory is decomposed into four rubric-guided stages, marked by XML tags:
1.  **Plan:** Within `<structured_plan>`, the agent performs `<deep_analysis>`, generates prospective `<rubrics>` (knowledge checklist, analytical criteria, negative constraints), and creates a `<research_plan>`.
2.  **Research:** An iterative loop of `<call_tool>` actions and `<state_evaluation>`, comparing evidence against the rubrics and plan, deciding whether to continue search or proceed.
3.  **Review:** Within `<review>`, the agent performs `<rubric_review>` to map evidence back to the rubrics and creates a `<writing_plan>` for the final answer.
4.  **Answer:** Synthesizes the final long-form response within `<answer>` tags, grounded with citations.

This scaffold is instilled into the base model (Qwen3-8B) via **teacher-student distillation** from Gemini-3.1-Pro, with aggressive rejection sampling to ensure structural compliance.

### 2. Stage-Structured GRPO (SS-GRPO)
Building on the scaffold, SS-GRPO provides finer-grained credit assignment. For a query $q$, $n$ rollouts $\{\tau_i\}_{i=1}^n \sim \pi_\theta(\cdot|q)$ are sampled and partitioned into $K=4$ stages. Let $B_{i,k}$ be the tokens in stage $k$ of rollout $\tau_i$, and $R_{i,k} \in [0,1]$ be the LLM-judge score under the corresponding stage rubric.

*   **Stagewise Returns:** Rather than assign the same final score to all tokens, SS-GRPO uses a **causal stage-dependence matrix** $\Lambda = (\lambda_{k,j})$, with $\lambda_{k,j}=0$ for $j < k$ and $\lambda_{k,k}=1$, and defines the return for stage $k$ as:
    $$G^{\Lambda}_{i,k} = \sum_{j=k}^{K} \lambda_{k,j} R_{i,j}.$$
    Each stage keeps its own score while receiving credit from downstream stages it enables.
*   **Stagewise Evolving-Rubric Judge:** The judge maintains a separate **rubric buffer** for each stage (Plan, Research, Review, Answer). It contrasts multiple rollouts for the same query to propose new, discriminative rubrics for each stage, reuses high-discrimination rubrics, and removes items that no longer separate trajectory quality. The judge can reference the agent's self-generated rubrics but scores against its own buffer.
*   **Stagewise Normalization and Objective:** Advantages are computed by normalizing returns separately within each stage across the rollout group:
    $$A_{i,k} = \frac{G^{\Lambda}_{i,k} - \frac{1}{n}\sum_{i'=1}^n G^{\Lambda}_{i',k}}{\text{Std}_{i'}[G^{\Lambda}_{i',k}] + \epsilon}.$$
    All tokens in the same stage block $B_{i,k}$ share the advantage $A_{i,k}$. The SS-GRPO objective is:
    $$\mathcal{L}_{\text{SS-GRPO}} = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K \sum_{t \in B_{i,k}} \min\left( \rho_{i,t} A_{i,k}, \text{clip}(\rho_{i,t}, 1-\eta, 1+\eta) A_{i,k} \right) + \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}),$$
    where $\rho_{i,t} = \pi_\theta(a_{i,t}|h_{i,t}) / \pi_{\theta_{\text{old}}}(a_{i,t}|h_{i,t})$.

**Theorem 3 (Judge-Aligned Stage-Weighted Credit):** The benefit of stage returns depends on a trade-off: intermediate judging recovers process information omitted by terminal-only rewards but introduces judge noise. Stage-weighted credit improves the gradient approximation when the recovered intermediate signal outweighs the cumulative judge misalignment.

### 3. Meta-Policy Training with Reinforcement Learning
A **shared backbone** serves as both the task policy and a **reflection meta-policy**.
*   **Joint Training:** After task-policy rollouts are judged, a query-trajectory pair is sampled. The backbone generates multiple **reflection candidates** conditioned on the fixed trajectory. A privileged LLM judge scores each candidate based on its usefulness for **within-episode refinement** (same query) and **cross-episode transfer** (related queries). These scores provide auxiliary RL rewards for updating the shared parameters.
*   **Rubric Bank:** The highest-scored accepted reflection is written into an **agent rubric bank** as natural-language memory. The bank supports two adaptation modes during training via a **windowed curriculum**:
    *   **Cross-episode transfer:** Retrieves reflections from related past questions for a new query.
    *   **Within-episode refinement:** Retrieves the query's own prior reflection on a repeated attempt.
*   **Efficient Asynchronous Execution:** A synchronous implementation would block the next task rollout. RubricEM uses a **one-step deferred** design: during step $N$, the inference engine runs task rollouts while the training engine updates the meta-policy using the reflection batch prepared from step $N-1$. Reflection generation and judging for step $N$ run asynchronously to prepare the batch for step $N+1$. This adds effectively no extra wall-clock overhead.

**Theorem 5 (Judge-Gated Co-Evolution):** Under a judge-gated local positive-transfer condition, a task-improving policy update also improves the reflection utility, and a reflection-improving update also improves the task performance. The joint update yields a strictly larger gain in task value than task-only training.

## Empirical Validation / Results

### Main Results on Long-Form Benchmarks
RubricEM-8B was evaluated on four representative long-form benchmarks: HealthBench, ResearchQA, DeepResearchBench (DRB), and ResearchRubrics.

**Table 1: Performance Comparison on Long-Form Benchmarks**
| Model | HealthBench | ResearchQA | DRB | ResearchRubrics | **Average** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Closed Deep Research** | | | | | |
| Gemini 3.1 Pro + Search | 47.5 | 74.5 | 44.4 | 49.1 | 53.9 |
| GPT-5 + Search | 59.5 | 78.2 | 50.7 | 60.5 | 62.2 |
| OpenAI Deep Research | 53.8 | 79.2 | 46.9 | 59.7 | 59.9 |
| **Open Deep Research Models** | | | | | |
| WebExplorer-8B | 33.7 | 64.8 | 36.7 | 33.4 | 42.2 |
| Tongyi DeepResearch-30B-A3B | 46.2 | 66.7 | 40.6 | 49.5 | 50.8 |
| DR Tulu-8B (SFT) | 38.1 | 68.5 | 39.0 | 38.4 | 46.0 |
| DR Tulu-8B (RL, 1900 steps) | **50.2** | 74.3 | 43.4 | 46.4 | 53.6 |
| **Ours** | | | | | |
| RubricEM-8B (SFT) | 39.0 | 71.8 | 43.0 | 42.8 | 49.2 |
| RubricEM-8B (RL, 1400 steps) | 49.3 | **74.5** | **47.8** | **50.3** | **55.5** |

*   **RubricEM-8B-RL** achieves the highest average score (**55.5**) among non-proprietary systems.
*   It surpasses strong baselines like DR Tulu-8B-RL (53.6) and Tongyi DeepResearch-30B-A3B (50.8).
*   It approaches proprietary systems, outperforming Perplexity Deep Research on average and remaining within 4.4 points of OpenAI Deep Research while outperforming it on DRB.
*   The RL recipe is **effective and efficient**: Starting from a structured SFT checkpoint (avg. 49.2), RL improves to 55.5 in **1400 steps**, fewer than DR Tulu's 1900 steps.

### Ablation Studies and Analysis
**Ablation of RL Components (600-step budget):** Figure 5 shows that under a matched 600-step budget, each proposed component contributes to performance gains:
*   **Baseline-RL** (standard answer-only GRPO)
*   **SS-GRPO** (adds stagewise rubric credit)
*   **Meta-Policy** (adds reflection training & rubric-bank retrieval)
*   **RubricEM (Full)** (combines SS-GRPO and Meta-Policy)
The full recipe performs best across benchmarks, showing that stagewise credit assignment and reusable-experience learning provide **complementary gains**.

**Structured Scaffolding and Inference-Time Reuse:** Figure 6 shows:
*   The rubric-guided scaffold improves both SFT distillation quality and subsequent RL gains.
*   Isolating the prompt-level effect, Gemini-3.1-Pro with the scaffold outperforms the same model with a standard ReAct prompt on DRB.
*   The learned meta-policy enables beneficial **cross-episode transfer** and **within-episode refinement** at inference time, whereas Baseline-RL does not benefit from the same reuse.

### Short-Form Benchmark Performance (Out-of-Domain Transfer)
Despite being trained primarily on long-form data, RubricEM shows strong generalization to short-form search benchmarks, indicating it learns transferable tool-use and evidence-grounding skills.

**Table 2: Short-Form Model Performance**
| Model | SimpleQA | 2Wiki | WebWalker | DSQA | **Avg.** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| DR Tulu-8B (SFT) | 75.5 | 66.5 | 31.9 | 5.3 | 44.8 |
| DR Tulu-8B (RL, 1900 steps) | 80.1 | 68.0 | 39.1 | 8.3 | 49.0 |
| **RubricEM-8B (SFT)** | **92.1** | **77.5** | **64.7** | **37.0** | **67.8** |
| **RubricEM-8B (RL, 1400 steps)** | **92.3** | **78.8** | **70.0** | **53.0** | **73.5** |

## Theoretical and Practical Implications
*   **Theoretical Implications:** Provides formal justification for explicit stage decomposition, stagewise credit assignment with imperfect judges, and the co-evolution of task and meta-policies via parameter sharing.
*   **Practical Implications:** Offers a concrete and effective recipe for RL in open-ended, long-horizon tasks beyond verifiable rewards: **expose task structure** (via rubric-guided scaffold), **assign credit to that structure** (via SS-GRPO), and **convert judged attempts into reusable experience** (via reflection meta-policy).
*   **Broader Impact:** Suggests rubrics should be a central, shared interface throughout the RL loop, not just a final evaluator. The meta-policy training approach makes experience reuse an explicit RL objective rather than an inference-time trick.

## Conclusion
RubricEM combines rubric-guided policy decomposition, stage-structured credit assignment, and reflection-based meta-policy training to train effective deep research agents beyond verifiable rewards. The resulting RubricEM-8B model demonstrates strong performance, efficiency, and generalization. The work supports a broader recipe for long-horizon RL in open-ended domains and opens directions for improving judge quality, scaling the approach, and applying it to other complex agentic tasks.

---

_Markdown view of https://picx.dev/p/UZqgjW, served by PicX — AI-generated visual whiteboard summaries of research papers._
