RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Summary (Overview)

  • Core Contribution: Introduces RubricEM, a reinforcement learning framework for training deep research agents that produce long-form reports in domains lacking verifiable ground-truth rewards. It uses rubrics as a shared interface for structuring policy execution, judge feedback, and agent memory.
  • Key Methodology: Combines three main components:
    1. Rubric-guided Structured Scaffold: Imposes explicit stage structure (Plan → Research → Review → Answer) on agent trajectories, with self-generated rubrics guiding decisions.
    2. Stage-Structured GRPO (SS-GRPO): Provides denser credit assignment by scoring each stage separately with stage-specific, evolving judge rubrics, rather than broadcasting a single terminal reward.
    3. Reflection Meta-Policy Training: Jointly trains a shared-backbone meta-policy to distill judged trajectories into reusable, rubric-grounded reflections stored in a "rubric bank" for future task attempts.
  • Empirical Results: The resulting RubricEM-8B model achieves state-of-the-art performance among comparable open models on four long-form research benchmarks (average score 55.5), outperforms prior RL systems with fewer training steps, and approaches proprietary systems.
  • Theoretical Underpinning: Formal analyses show the value of explicit stage information for policy value, the conditions under which stagewise credit assignment improves gradient approximation, and how judge-gated reflection training can co-evolve with the task policy.
  • Efficient Training: An asynchronous pipeline design allows meta-policy training to run concurrently with task-policy rollouts, avoiding the sequential bottlenecks common in prior meta-RL work.

Introduction and Theoretical Foundation

Training deep research agents—systems that autonomously plan, search, evaluate evidence, and synthesize long-form reports—pushes reinforcement learning (RL) beyond the regime of verifiable rewards. The outputs lack ground-truth answers, trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for converting past attempts into reusable experience.

The central question addressed is: How can RL train deep research agents beyond verifiable rewards, while enabling long-horizon credit assignment and learning from experience?

Theoretical Foundation: The paper argues that rubrics (evaluation criteria) should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. This view is inspired by an Expectation–Maximization (EM) perspective: the latent structure of an open-ended task (what matters, where credit belongs, what should be remembered) is estimated through rubrics, which then condition policy reasoning, judge scoring, and memory evolution.

Formal Value of Stage Information (Theorem 1): A key theoretical insight formalizes the benefit of explicit stage decomposition. Let hh denote a random decision point, c=ϕ(h)c = \phi(h) a compressed state representation, zz the current stage label, and U(h,a)U(h, a) the expected downstream value of action aa. Define the value functions for a flat policy and a stage-aware policy:

Vflat:=E[maxaAE[U(h,a)c]],Vstage:=E[maxaAE[U(h,a)c,z]].V_{\text{flat}} := \mathbb{E}\left[\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c]\right], \quad V_{\text{stage}} := \mathbb{E}\left[\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z]\right].

If there exists a context set C0\mathcal{C}_0 with positive probability and two task-relevant stages such that for every cC0c \in \mathcal{C}_0, p(zc)>0p(z|c) > 0, p(zc)>0p(z'|c) > 0, and argmaxaAE[U(h,a)c,z]argmaxaAE[U(h,a)c,z]=\arg\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z] \cap \arg\max_{a \in \mathcal{A}} \mathbb{E}[U(h, a) | c, z'] = \emptyset, then Vstage>VflatV_{\text{stage}} > V_{\text{flat}}. This shows that explicit stage information strictly improves policy value when the optimal actions for different stages disagree given the same compressed context.

Methodology

RubricEM consists of three integrated components.

1. Structured Reasoning Scaffold

The agent's trajectory is decomposed into four rubric-guided stages, marked by XML tags:

  1. Plan: Within <structured_plan>, the agent performs <deep_analysis>, generates prospective <rubrics> (knowledge checklist, analytical criteria, negative constraints), and creates a <research_plan>.
  2. Research: An iterative loop of <call_tool> actions and <state_evaluation>, comparing evidence against the rubrics and plan, deciding whether to continue search or proceed.
  3. Review: Within <review>, the agent performs <rubric_review> to map evidence back to the rubrics and creates a <writing_plan> for the final answer.
  4. Answer: Synthesizes the final long-form response within <answer> tags, grounded with citations.

This scaffold is instilled into the base model (Qwen3-8B) via teacher-student distillation from Gemini-3.1-Pro, with aggressive rejection sampling to ensure structural compliance.

2. Stage-Structured GRPO (SS-GRPO)

Building on the scaffold, SS-GRPO provides finer-grained credit assignment. For a query qq, nn rollouts {τi}i=1nπθ(q)\{\tau_i\}_{i=1}^n \sim \pi_\theta(\cdot|q) are sampled and partitioned into K=4K=4 stages. Let Bi,kB_{i,k} be the tokens in stage kk of rollout τi\tau_i, and Ri,k[0,1]R_{i,k} \in [0,1] be the LLM-judge score under the corresponding stage rubric.

  • Stagewise Returns: Rather than assign the same final score to all tokens, SS-GRPO uses a causal stage-dependence matrix Λ=(λk,j)\Lambda = (\lambda_{k,j}), with λk,j=0\lambda_{k,j}=0 for j<kj < k and λk,k=1\lambda_{k,k}=1, and defines the return for stage kk as: Gi,kΛ=j=kKλk,jRi,j.G^{\Lambda}_{i,k} = \sum_{j=k}^{K} \lambda_{k,j} R_{i,j}. Each stage keeps its own score while receiving credit from downstream stages it enables.
  • Stagewise Evolving-Rubric Judge: The judge maintains a separate rubric buffer for each stage (Plan, Research, Review, Answer). It contrasts multiple rollouts for the same query to propose new, discriminative rubrics for each stage, reuses high-discrimination rubrics, and removes items that no longer separate trajectory quality. The judge can reference the agent's self-generated rubrics but scores against its own buffer.
  • Stagewise Normalization and Objective: Advantages are computed by normalizing returns separately within each stage across the rollout group: Ai,k=Gi,kΛ1ni=1nGi,kΛStdi[Gi,kΛ]+ϵ.A_{i,k} = \frac{G^{\Lambda}_{i,k} - \frac{1}{n}\sum_{i'=1}^n G^{\Lambda}_{i',k}}{\text{Std}_{i'}[G^{\Lambda}_{i',k}] + \epsilon}. All tokens in the same stage block Bi,kB_{i,k} share the advantage Ai,kA_{i,k}. The SS-GRPO objective is: LSS-GRPO=1ni=1nk=1KtBi,kmin(ρi,tAi,k,clip(ρi,t,1η,1+η)Ai,k)+βDKL(πθπref),\mathcal{L}_{\text{SS-GRPO}} = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K \sum_{t \in B_{i,k}} \min\left( \rho_{i,t} A_{i,k}, \text{clip}(\rho_{i,t}, 1-\eta, 1+\eta) A_{i,k} \right) + \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}), where ρi,t=πθ(ai,thi,t)/πθold(ai,thi,t)\rho_{i,t} = \pi_\theta(a_{i,t}|h_{i,t}) / \pi_{\theta_{\text{old}}}(a_{i,t}|h_{i,t}).

Theorem 3 (Judge-Aligned Stage-Weighted Credit): The benefit of stage returns depends on a trade-off: intermediate judging recovers process information omitted by terminal-only rewards but introduces judge noise. Stage-weighted credit improves the gradient approximation when the recovered intermediate signal outweighs the cumulative judge misalignment.

3. Meta-Policy Training with Reinforcement Learning

A shared backbone serves as both the task policy and a reflection meta-policy.

  • Joint Training: After task-policy rollouts are judged, a query-trajectory pair is sampled. The backbone generates multiple reflection candidates conditioned on the fixed trajectory. A privileged LLM judge scores each candidate based on its usefulness for within-episode refinement (same query) and cross-episode transfer (related queries). These scores provide auxiliary RL rewards for updating the shared parameters.
  • Rubric Bank: The highest-scored accepted reflection is written into an agent rubric bank as natural-language memory. The bank supports two adaptation modes during training via a windowed curriculum:
    • Cross-episode transfer: Retrieves reflections from related past questions for a new query.
    • Within-episode refinement: Retrieves the query's own prior reflection on a repeated attempt.
  • Efficient Asynchronous Execution: A synchronous implementation would block the next task rollout. RubricEM uses a one-step deferred design: during step NN, the inference engine runs task rollouts while the training engine updates the meta-policy using the reflection batch prepared from step N1N-1. Reflection generation and judging for step NN run asynchronously to prepare the batch for step N+1N+1. This adds effectively no extra wall-clock overhead.

Theorem 5 (Judge-Gated Co-Evolution): Under a judge-gated local positive-transfer condition, a task-improving policy update also improves the reflection utility, and a reflection-improving update also improves the task performance. The joint update yields a strictly larger gain in task value than task-only training.

Empirical Validation / Results

Main Results on Long-Form Benchmarks

RubricEM-8B was evaluated on four representative long-form benchmarks: HealthBench, ResearchQA, DeepResearchBench (DRB), and ResearchRubrics.

Table 1: Performance Comparison on Long-Form Benchmarks

ModelHealthBenchResearchQADRBResearchRubricsAverage
Closed Deep Research
Gemini 3.1 Pro + Search47.574.544.449.153.9
GPT-5 + Search59.578.250.760.562.2
OpenAI Deep Research53.879.246.959.759.9
Open Deep Research Models
WebExplorer-8B33.764.836.733.442.2
Tongyi DeepResearch-30B-A3B46.266.740.649.550.8
DR Tulu-8B (SFT)38.168.539.038.446.0
DR Tulu-8B (RL, 1900 steps)50.274.343.446.453.6
Ours
RubricEM-8B (SFT)39.071.843.042.849.2
RubricEM-8B (RL, 1400 steps)49.374.547.850.355.5
  • RubricEM-8B-RL achieves the highest average score (55.5) among non-proprietary systems.
  • It surpasses strong baselines like DR Tulu-8B-RL (53.6) and Tongyi DeepResearch-30B-A3B (50.8).
  • It approaches proprietary systems, outperforming Perplexity Deep Research on average and remaining within 4.4 points of OpenAI Deep Research while outperforming it on DRB.
  • The RL recipe is effective and efficient: Starting from a structured SFT checkpoint (avg. 49.2), RL improves to 55.5 in 1400 steps, fewer than DR Tulu's 1900 steps.

Ablation Studies and Analysis

Ablation of RL Components (600-step budget): Figure 5 shows that under a matched 600-step budget, each proposed component contributes to performance gains:

  • Baseline-RL (standard answer-only GRPO)
  • SS-GRPO (adds stagewise rubric credit)
  • Meta-Policy (adds reflection training & rubric-bank retrieval)
  • RubricEM (Full) (combines SS-GRPO and Meta-Policy) The full recipe performs best across benchmarks, showing that stagewise credit assignment and reusable-experience learning provide complementary gains.

Structured Scaffolding and Inference-Time Reuse: Figure 6 shows:

  • The rubric-guided scaffold improves both SFT distillation quality and subsequent RL gains.
  • Isolating the prompt-level effect, Gemini-3.1-Pro with the scaffold outperforms the same model with a standard ReAct prompt on DRB.
  • The learned meta-policy enables beneficial cross-episode transfer and within-episode refinement at inference time, whereas Baseline-RL does not benefit from the same reuse.

Short-Form Benchmark Performance (Out-of-Domain Transfer)

Despite being trained primarily on long-form data, RubricEM shows strong generalization to short-form search benchmarks, indicating it learns transferable tool-use and evidence-grounding skills.

Table 2: Short-Form Model Performance

ModelSimpleQA2WikiWebWalkerDSQAAvg.
DR Tulu-8B (SFT)75.566.531.95.344.8
DR Tulu-8B (RL, 1900 steps)80.168.039.18.349.0
RubricEM-8B (SFT)92.177.564.737.067.8
RubricEM-8B (RL, 1400 steps)92.378.870.053.073.5

Theoretical and Practical Implications

  • Theoretical Implications: Provides formal justification for explicit stage decomposition, stagewise credit assignment with imperfect judges, and the co-evolution of task and meta-policies via parameter sharing.
  • Practical Implications: Offers a concrete and effective recipe for RL in open-ended, long-horizon tasks beyond verifiable rewards: expose task structure (via rubric-guided scaffold), assign credit to that structure (via SS-GRPO), and convert judged attempts into reusable experience (via reflection meta-policy).
  • Broader Impact: Suggests rubrics should be a central, shared interface throughout the RL loop, not just a final evaluator. The meta-policy training approach makes experience reuse an explicit RL objective rather than an inference-time trick.

Conclusion

RubricEM combines rubric-guided policy decomposition, stage-structured credit assignment, and reflection-based meta-policy training to train effective deep research agents beyond verifiable rewards. The resulting RubricEM-8B model demonstrates strong performance, efficiency, and generalization. The work supports a broader recipe for long-horizon RL in open-ended domains and opens directions for improving judge quality, scaling the approach, and applying it to other complex agentic tasks.