# Flow-OPD: On-Policy Distillation for Flow Matching Models

> Flow-OPD introduces a two-stage on-policy distillation framework for flow matching models that consolidates multiple specialized teacher models into a single unified student, achieving a 10-point GenEval improvement and surpassing teacher performance.

- **Source:** [arXiv](https://arxiv.org/abs/2605.08063)
- **Published:** 2026-05-12
- **Permalink:** https://picx.dev/p/1cRHex
- **Whiteboard:** https://picx.dev/p/1cRHex/image

## Summary

# Flow-OPD: On-Policy Distillation for Flow Matching Models

## Summary (Overview)
*   **Key Problem:** Flow Matching (FM) text-to-image models suffer from **reward sparsity** and **gradient interference** when aligned with multiple heterogeneous objectives (e.g., text rendering, aesthetics), leading to a "seesaw effect" where metrics compete and reward hacking occurs.
*   **Core Solution:** Introduces **Flow-OPD**, the first framework to integrate **On-Policy Distillation (OPD)** into FM models. It uses a **two-stage alignment strategy**: 1) training domain-specialized teacher models via single-reward GRPO, and 2) consolidating their expertise into a single student via on-policy sampling, task-routing labeling, and dense trajectory-level supervision.
*   **Novel Regularization:** Proposes **Manifold Anchor Regularization (MAR)** to mitigate aesthetic degradation by anchoring the student's generation to a high-quality manifold using a task-agnostic teacher.
*   **Key Results:** On Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from **63 to 92** and OCR accuracy from **59 to 94**, achieving an overall improvement of roughly **10 points** over vanilla GRPO while preserving image fidelity and human preference.
*   **Emergent Effect:** The unified student model exhibits a **"teacher-surpassing" effect**, matching or exceeding the performance of specialized teachers and demonstrating robust out-of-distribution generalization.

## Introduction and Theoretical Foundation
Flow Matching (FM) has emerged as a superior paradigm for generative modeling, offering efficiency and high-fidelity synthesis. However, aligning FM models with multiple, heterogeneous tasks (text rendering, compositional reasoning, aesthetics) presents significant challenges. Existing methods like **Group Relative Policy Optimization (GRPO)** ported to FM show promise in single-reward scenarios but fail in multi-task settings due to:
1.  **Reward Sparsity:** Scalar-valued rewards lack granularity to harmonize conflicting objectives.
2.  **Gradient Interference:** Joint optimization of heterogeneous objectives leads to conflicting gradients, causing a "seesaw effect" and catastrophic forgetting.

The paper draws inspiration from the success of **On-Policy Distillation (OPD)** in Large Language Models (LLMs), where specialized expert knowledge is distilled into a unified student. The core question addressed is: **Can FM models similarly leverage OPD to integrate diverse teacher strengths?**

**Preliminaries:**
*   **Flow Matching (FM)** maps a noise distribution $p_0$ to data $p_{\text{data}}$ via an ODE $d x_t = v_t ( x_t , t ) d t$. Under the Optimal Transport (OT) formulation, the path is $x_t = (1 - t) x_0 + t x_1$, and the model $v_\theta$ learns the constant velocity $( x_1 - x_0 )$ via:
    $$L_{\text{FM}} ( \theta ) = E_{t, x_0 , x_1} \left[ \| v_\theta ( x_t , t ) - ( x_1 - x_0 ) \|^2 \right]$$
    (Equation 1)
*   **On-Policy Distillation (OPD)** for autoregressive models minimizes the Reverse KL divergence between student and teacher distributions over student-generated trajectories:
    $$L_{\text{OPD}} = - E_{y \sim \pi_\theta} \left[ \log \frac{\pi_{\text{teacher}} ( y | x )}{\pi_\theta ( y | x )} \right] = D_{\text{KL}} ( \pi_\theta \| \pi_{\text{teacher}} )$$
    (Equation 2)

## Methodology
Flow-OPD is a two-stage framework: **1) Cultivate specialized teachers**, and **2) Distill expertise into a unified student**.

### Stage 1: Teacher Training
Domain-expert teachers are trained using **single-reward GRPO fine-tuning**, allowing each to reach its performance ceiling in isolation (e.g., a teacher for OCR, another for PickScore).

### Stage 2: Student Training via Multi-Teacher On-Policy Distillation
This stage involves three key components:

**1. Cold Start:** To establish a robust initial policy for the student, two strategies are explored:
*   **SFT-based initialization:** Uses trajectories sampled from specialized teachers.
*   **Model-merging initialization:** Superposes the parameters of divergent teachers into a unified state.

**2. On-Policy Sampling & Task-Routing Labeling:** To facilitate exploration, the deterministic ODE is converted to a Stochastic Differential Equation (SDE):
$$d x_t = \left[ v_\theta ( x_t , t ) + \frac{\sigma_t^2}{2 t} ( x_t + (1 - t) v_\theta ( x_t , t )) \right] d t + \sigma_t d w$$
(Equation 5)

This yields a stochastic behavioral policy $\pi_\theta ( x_{t-\Delta t} | x_t , c ) = N ( \mu_\theta ( x_t , t ) , \sigma_t^2 \Delta tI )$ (Equation 6). For each explored state $x_t$, a **hard routing mechanism** $\mathbb{1}_{T(c)=k}$ maps the textual condition $c$ to a single domain expert $k$, which provides the reference velocity field $v_{\phi_k} ( x_t , t, c )$. The target flow is:
$$v_{\text{target}} ( x_t , t, c ) = v_{\phi_k} ( x_t , t, c ), \quad \text{where } k = R(c)$$
(Equation 7)

**3. Dense KL Reward & Policy Update:** The KL divergence between the student and target transition policies (both Gaussian with shared covariance) reduces to an L2 distance between their means/vector fields:
$$D_{\text{KL}} ( \pi_\theta \| \pi_{\text{target}} ) = \frac{\Delta t}{2} \left( \frac{\sigma_t (1 - t)}{2 t} + \frac{1}{\sigma_t} \right)^2 \| v_\theta ( x_t , t, c ) - v_{\text{target}} ( x_t , t, c ) \|^2$$
(Equation 9)

The **immediate dense reward** $r_t^{(i)}$ for the $i$-th trajectory is defined using the *detached* student vector field $\bar{v}_\theta$:
$$r_t^{(i)} = - w(t) \| \bar{v}_\theta ( x_t^{(i)} , t, c ) - v_{\text{target}} ( x_t^{(i)} , t, c ) \|^2$$
(Equation 10)

A **clipped policy gradient update** (PPO-style) is applied using this dense reward. For a batch of $B$ prompts, each generating $G$ trajectories, the surrogate objective is:
$$J ( \theta ) \approx \frac{1}{B \times G} \sum_{j=1}^{B} \sum_{i=1}^{G} \sum_{t=0}^{T} \min \left( \rho_{t,i,j} ( \theta ) r_{t,i,j}^{\text{OPD}}, \text{clip} \left( \rho_{t,i,j} ( \theta ), 1 - \epsilon, 1 + \epsilon \right) r_{t,i,j}^{\text{OPD}} \right)$$
(Equation 11)
where $\rho_{t,i,j} ( \theta ) = \frac{\pi_\theta ( a_{t,i,j} | s_{t,i,j} )}{\pi_{\theta_{\text{old}}} ( a_{t,i,j} | s_{t,i,j} )}$ is the policy ratio.

**4. Manifold Anchor Regularization (MAR):** To prevent aesthetic degradation, a frozen **aesthetic teacher** (e.g., optimized via DeQA) provides a regularizing vector field $v_{\text{base}}$. The total loss combines the policy loss $L_{\text{Policy}}(\theta)$ (negative of $J(\theta)$) and a dense KL penalty:
$$L_{\text{Total}} ( \theta ) = L_{\text{Policy}} ( \theta ) + \lambda E_{c,t,x_t \sim \rho_t^\theta} \left[ w(t) \| v_\theta ( x_t , t, c ) - v_{\text{aesthetic}} ( x_t , t, c ) \|^2 \right]$$
(Equation 12)

## Empirical Validation / Results
Experiments are conducted on Stable Diffusion 3.5 Medium across four tasks: **GenEval** (composition), **OCR** (text rendering), **PickScore** (human preference), and **DeQA** (image quality).

**Baselines:** Monolithic-Reward GRPO (GRPO-[reward]), and Hybrid-Reward GRPO (GRPO-Mix) with fixed reward ratio.

### Quantitative Results

**Table 2: Model Performance Comparison**
| Model | GenEval | OCR Acc. | DEQA | PickScore | Avg |
|---|---|---|---|---|---|
| SD-3.5-M | 0.63 | 0.59 | 4.07 | 21.64 | 0.7166 |
| +GRPO-Geneval | 0.94 | 0.65 | 4.01 | 21.53 | 0.8050 |
| +GRPO-OCR | 0.64 | 0.92 | 4.06 | 21.69 | 0.8016 |
| +GRPO-deqa | 0.64 | 0.66 | 4.23 | 23.02 | 0.7578 |
| +GRPO-Pickscore | 0.51 | 0.69 | 4.22 | 23.19 | 0.7340 |
| GRPO-Mix | 0.73 | 0.83 | 4.33 | 21.84 | 0.8165 |
| SFT+GRPO-Mix | 0.85 | 0.86 | 4.29 | 21.79 | 0.7166 |
| Merge+GRPO-Mix | 0.84 | 0.86 | 4.18 | 21.87 | 0.7166 |
| **Ours (SFT)** | **0.91** | **0.92** | **4.29** | **21.83** | **0.8819** |
| **Ours (Merge)** | **0.92** | **0.94** | **4.35** | **23.08** | **0.9044** |

*Flow-OPD (Merge) achieves the best overall performance, significantly outperforming all baselines and matching/surpassing specialized teachers.*

### Key Findings:
1.  **Flow-OPD outperforms vanilla GRPO:** Achieves ~10-point improvement on average normalized metrics.
2.  **Resolves gradient interference:** Avoids the catastrophic forgetting seen in multi-reward GRPO (Table 1).
3.  **Cold-start is crucial:** Both SFT and merging initialization provide a strong foundation, with merging yielding slightly better results (Fig. 4).
4.  **Teacher-surpassing effect:** The student model synthesizes knowledge from multiple teachers, leading to emergent superiority in some cases.
5.  **Robust OOD generalization:** Flow-OPD shows superior performance on the T2I-CompBench benchmark (Table 3).
6.  **MAR preserves aesthetics:** Quantitative (Table 4) and qualitative (Fig. 5) results show MAR effectively prevents background mode collapse and semantic redundancy, maintaining high visual quality.

**Table 4: Performance on General Image Quality and Alignment Metrics**
| Model | ImageReward | Aesthetic | UnifiedReward | HPS-v2.1 | QwenVL Score |
|---|---|---|---|---|---|
| SD-3.5-M | 1.02 | 5.87 | 3.339 | 0.2982 | 3.45 |
| GRPO-DeQA | 1.33 | 5.97 | 3.456 | 0.2846 | 3.68 |
| GRPO-mix | 1.23 | 5.93 | 3.501 | 0.3101 | 3.88 |
| w.o. MAR | 1.26 | 5.89 | 3.518 | 0.2998 | 3.82 |
| **Ours (Merge)** | **1.36** | **6.23** | **3.659** | **0.3302** | **4.05** |

## Theoretical and Practical Implications
*   **Theoretical:** Provides a formal framework for translating OPD (a discrete-sequence PG method) to the continuous-time FM domain via the derivation of a dense KL reward from vector field discrepancies. It analytically addresses the problems of reward sparsity $\left( \langle \nabla_\theta J_k, \nabla_\theta J_1 \rangle < 0 \right)$ and gradient interference in multi-task learning.
*   **Practical:** Offers a **scalable alignment paradigm** for building **generalist text-to-image models**. By decoupling expertise acquisition (teacher training) from model unification (distillation), it enables the consolidation of diverse capabilities without degradation. The MAR component ensures that functional alignment does not compromise aesthetic quality, making the framework suitable for real-world applications demanding both precision and visual appeal.

## Conclusion
Flow-OPD successfully integrates on-policy distillation into Flow Matching models, overcoming the fundamental limitations of sparse-reward multi-task alignment. By replacing scalar rewards with **dense, trajectory-level supervision** from multiple teachers and anchoring generation to a high-quality manifold via **MAR**, it breaks the "seesaw effect" and achieves superior performance across composition, text rendering, and aesthetics. The framework demonstrates **scalability**, **robust generalization**, and an **emergent teacher-surpassing effect**, establishing a new paradigm for developing high-capability, generalist text-to-image models. Future work may explore extending this approach to other generative domains and more complex teacher ensembles.

---

_Markdown view of https://picx.dev/p/1cRHex, served by PicX — AI-generated visual whiteboard summaries of research papers._