Flow-OPD: On-Policy Distillation for Flow Matching Models

Summary (Overview)

  • Key Problem: Flow Matching (FM) models suffer from reward sparsity and gradient interference when aligned with multiple heterogeneous objectives (e.g., text rendering, aesthetics), leading to a "seesaw effect" where metrics compete and reward hacking occurs.
  • Core Solution: Flow-OPD, a novel post-training framework that integrates On-Policy Distillation (OPD) into FM models, replacing sparse scalar rewards with dense, trajectory-level supervision from domain-specialized teachers.
  • Main Contributions: 1) A two-stage alignment strategy (train teachers, distill student); 2) A Flow-based Cold-Start initialization; 3) Manifold Anchor Regularization (MAR) to preserve aesthetic quality.
  • Key Results: Flow-OPD achieves a ~10-point improvement over vanilla GRPO, raising GenEval from 63 to 92 and OCR accuracy from 59 to 94. The unified student model matches or surpasses specialized teachers and shows robust out-of-distribution generalization.
  • Emergent Effect: The student exhibits a "teacher-surpassing" capability, outperforming individual teachers in some cases due to knowledge cross-pollination from dense multi-expert supervision.

Introduction and Theoretical Foundation

Flow Matching (FM) has emerged as a superior paradigm for generative modeling, learning continuous-time velocity fields for efficient and high-fidelity synthesis. However, aligning FM models for multi-dimensional tasks (text rendering, compositional reasoning, human aesthetics) using Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) faces critical bottlenecks:

  • Reward Sparsity: Scalar-valued rewards lack granularity to coordinate heterogeneous objectives.
  • Gradient Interference: Joint optimization of conflicting tasks within a shared parameter space leads to a "seesaw effect"—improving one metric degrades another.

Inspired by the success of On-Policy Distillation (OPD) in Large Language Models (LLMs) for harmonizing multi-domain capabilities, this paper proposes Flow-OPD, the first framework to integrate OPD into FM post-training. The goal is to decouple expertise acquisition (training specialized teachers) from model unification (distilling into a single student) using dense supervision on the student's own generated trajectories, thereby overcoming the limitations of sparse-reward RL.

Methodology

Flow-OPD employs a two-stage alignment strategy.

Stage 1: Cultivating Domain-Specialized Teachers

Each expert teacher model is trained via single-reward GRPO fine-tuning on a specific task (e.g., GenEval, OCR, PickScore, DeQA), allowing it to reach its performance ceiling in isolation.

Stage 2: Unified Student Training via On-Policy Distillation

1. Flow-based Cold-Start

To establish a robust initial policy for the student, two variants are proposed:

  • SFT-based Initialization: Uses trajectories sampled from specialized teachers for supervised fine-tuning.
  • Model Merging Initialization: Superposes parameters of divergent teachers into a unified state, placing the student in a high-competence region of the loss landscape.

2. On-Policy Sampling

To expose the student's distribution shifts and enable exploration, the deterministic probability flow ODE is converted to a Stochastic Differential Equation (SDE):

dxt=[vθ(xt,t)+σt22t(xt+(1t)vθ(xt,t))]dt+σtdwd x_t = \left[ v_\theta(x_t, t) + \frac{\sigma_t^2}{2t}(x_t + (1 - t) v_\theta(x_t, t)) \right] dt + \sigma_t d w

Applying Euler-Maruyama discretization yields a local isotropic Gaussian policy:

πθ(xtΔtxt,c)=N(μθ(xt,t),σt2ΔtI)\pi_\theta(x_{t-\Delta t} | x_t, c) = N(\mu_\theta(x_t, t), \sigma_t^2 \Delta t I)

The student samples GG independent trajectories per prompt, generating an on-policy marginal distribution xtρθt(c)x_t \sim \rho_\theta^t(\cdot|c).

3. Task-Routing Labeling

A hard routing mechanism 1T(c)=k\mathbb{1}_{T(c)=k} maps the textual condition cc to a specific domain expert kk. Only that teacher provides the reference velocity field:

vtarget(xt,t,c)=vϕk(xt,t,c),where k=R(c)v_{\text{target}}(x_t, t, c) = v_{\phi_k}(x_t, t, c), \quad \text{where } k = R(c)

This defines a task-specific target transition policy πtarget=N(μtarget(xt,t),σt2ΔtI)\pi_{\text{target}} = N(\mu_{\text{target}}(x_t, t), \sigma_t^2 \Delta t I).

4. Deriving the Dense KL Reward

The Reverse KL divergence between the student and target policies, which share the same isotropic covariance, is derived analytically as an L2L_2 distance between their means:

DKL(πθπtarget)=μθ(xt,t)μtarget(xt,t)22σt2ΔtD_{\text{KL}}(\pi_\theta \parallel \pi_{\text{target}}) = \frac{\|\mu_\theta(x_t, t) - \mu_{\text{target}}(x_t, t)\|^2}{2\sigma_t^2 \Delta t}

Substituting the parameterized means from the discretized SDE simplifies to:

DKL(πθπtarget)=Δt2[σt(1t)2t+1σt]2vθ(xt,t,c)vtarget(xt,t,c)2D_{\text{KL}}(\pi_\theta \parallel \pi_{\text{target}}) = \frac{\Delta t}{2} \left[ \frac{\sigma_t (1 - t)}{2t} + \frac{1}{\sigma_t} \right]^2 \| v_\theta(x_t, t, c) - v_{\text{target}}(x_t, t, c) \|^2

The detached immediate dense reward for the ii-th trajectory is then:

rt(i)=w(t)vˉθ(xt(i),t,c)vtarget(xt(i),t,c)2r_t^{(i)} = - w(t) \| \bar{v}_\theta(x_t^{(i)}, t, c) - v_{\text{target}}(x_t^{(i)}, t, c) \|^2

where w(t)w(t) is the time-adaptive scaling factor and vˉθ\bar{v}_\theta is the detached student vector field.

5. Clipped Policy Gradient Update

Using the dense reward rtOPDr_t^{\text{OPD}} directly, a PPO-clipped surrogate objective is constructed:

J(θ)1B×Gj=1Bi=1Gt=0Tmin(ρt,i,j(θ)rt,i,jOPD,clip(ρt,i,j(θ),1ϵ,1+ϵ)rt,i,jOPD)J(\theta) \approx \frac{1}{B \times G} \sum_{j=1}^{B} \sum_{i=1}^{G} \sum_{t=0}^{T} \min\left( \rho_{t,i,j}(\theta) r_{t,i,j}^{\text{OPD}}, \text{clip}\left( \rho_{t,i,j}(\theta), 1-\epsilon, 1+\epsilon \right) r_{t,i,j}^{\text{OPD}} \right)

where ρt,i,j(θ)=πθ(at,i,jst,i,j)πθold(at,i,jst,i,j)\rho_{t,i,j}(\theta) = \frac{\pi_\theta(a_{t,i,j}|s_{t,i,j})}{\pi_{\theta_{\text{old}}}(a_{t,i,j}|s_{t,i,j})} is the policy ratio. Parameters are updated via gradient ascent: θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta). Gradients flow only through the policy ratio, as rOPDr^{\text{OPD}} is detached.

6. Manifold Anchor Regularization (MAR)

To prevent reward hacking and aesthetic degradation, a frozen aesthetic teacher (e.g., optimized via DeQA) provides a regularizing vector field vbasev_{\text{base}}. The total loss combines the policy loss LPolicy(θ)L_{\text{Policy}}(\theta) (negative of J(θ)J(\theta)) and a dense KL penalty:

LTotal(θ)=LPolicy(θ)+λEc,t,xtρθt[w(t)vθ(xt,t,c)vaesthetic(xt,t,c)2]L_{\text{Total}}(\theta) = L_{\text{Policy}}(\theta) + \lambda E_{c,t,x_t \sim \rho_\theta^t} \left[ w(t) \| v_\theta(x_t, t, c) - v_{\text{aesthetic}}(x_t, t, c) \|^2 \right]

where λ\lambda is a weighting coefficient. MAR anchors the student to a high-quality visual manifold.

Empirical Validation / Results

Experiments are conducted on Stable Diffusion 3.5 Medium (SD-3.5-M) across four tasks: GenEval (compositional image generation), OCR (visual text rendering), PickScore (human preference), and DeQA (image quality).

Quantitative Performance

Table 2: Model Performance Comparison

ModelGenEvalOCR Acc.DEQAPickScoreAvg
SD-3.5-M0.630.594.0721.640.7166
+GRPO-Geneval0.940.654.0121.530.8050
+GRPO-OCR0.640.924.0621.690.8016
+GRPO-deqa0.640.664.2323.020.7578
+GRPO-Pickscore0.510.694.2223.190.7340
GRPO-Mix0.730.834.3321.840.8165
SFT+GRPO-Mix0.850.864.2921.790.7166
Merge+GRPO-Mix0.840.864.1821.870.7166
Ours (SFT)0.910.924.2921.830.8819
Ours (Merge)0.920.944.3523.080.9044
  • Key Findings: Flow-OPD (Merge) achieves the best overall performance, matching or surpassing specialized teachers (bolded & underlined scores). It significantly outperforms the GRPO-Mix baseline (scalar reward mixing), which suffers from capability degradation.
  • Improvement: Flow-OPD raises GenEval from 63 to 92 and OCR accuracy from 59 to 94, an overall ~10-point improvement over vanilla GRPO.

Qualitative Results & Teacher-Surpassing Effect

Figure 3 shows Flow-OPD achieves superior instruction-following, high-fidelity synthesis, and structural coherence. Notably, the student model sometimes succeeds in edge cases where all individual teachers fail—an emergent "teacher-surpassing" effect attributed to knowledge cross-pollination from dense multi-expert supervision.

Ablation Studies

Cold-Start Impact

Figure 4 shows both SFT and Merge cold-start strategies establish a robust foundation, with Merge initialization leading to the highest scores. Flow-OPD consistently outperforms from-scratch and cold-started multi-task GRPO.

Out-of-Distribution (OOD) Generalization

Table 3: T2I-CompBench++ Result

ModelColorShapeTextureComplex3D-SpatialNumeracyNon-Spatial
SD3.5-M0.79940.56690.73380.38000.37390.59270.3146
GRPO-mix0.79660.58030.73920.36770.36810.63880.3130
Cold Start0.81730.61260.73420.38700.42490.64580.3145
Cold Start+GRPO0.80310.59850.74090.38420.40170.62690.3136
Ours (Merge)0.82980.62920.74460.39430.45650.68370.3163

Flow-OPD demonstrates superior OOD generalization, achieving state-of-the-art (SOTA) performance across multiple compositional metrics and mitigating catastrophic forgetting seen in GRPO.

Manifold Anchor Regularization (MAR)

Figure 5 qualitatively shows MAR prevents background mode collapse and semantic redundancy induced by vanilla GRPO optimization.

Table 4: Performance on General Image Quality and Alignment Metrics

ModelImageRewardAestheticUnifiedRewardHPS-v2.1QwenVL Score
SD-3.5-M1.025.873.3390.29823.45
GRPO-DeQA1.335.973.4560.28463.68
GRPO-mix1.235.933.5010.31013.88
w.o. MAR1.265.893.5180.29983.82
Ours (Merge)1.366.233.6590.33024.05

MAR leverages full-data supervision from a task-agnostic teacher, significantly enhancing visual quality and human preference alignment.

Theoretical and Practical Implications

  • Theoretical: Flow-OPD provides a scalable solution to the fundamental problems of reward sparsity and gradient interference in multi-task FM alignment. By replacing scalar rewards with dense, trajectory-level supervision, it breaks the "seesaw effect" and enables harmonious integration of heterogeneous expertise.
  • Practical: The framework establishes a new paradigm for building generalist text-to-image models that master diverse tasks without degrading core capabilities. The teacher-surpassing effect suggests that collective dense supervision can lead to emergent superior performance.
  • Methodological: The introduction of On-Policy Distillation to the vision community bridges a successful LLM technique with FM models. Manifold Anchor Regularization offers a principled way to decouple functional alignment from aesthetic preservation, crucial for real-world applications.

Conclusion

Flow-OPD successfully integrates On-Policy Distillation into Flow Matching models, resolving the critical bottlenecks of reward sparsity and gradient interference in multi-task alignment. Through a two-stage strategy (specialized teacher training + unified student distillation), Flow-based Cold-Start, and Manifold Anchor Regularization, it achieves:

  • Significant performance gains (~10 points over GRPO) across key benchmarks.
  • Consolidation of diverse expertise (composition, typography, aesthetics) into a single model.
  • Emergent "teacher-surpassing" capabilities and robust out-of-distribution generalization.
  • Preservation of high visual fidelity and human-preference alignment.

Flow-OPD provides a scalable alignment paradigm for developing generalist text-to-image models with superior generative integrity. Future work may explore extending this framework to other generative model families and more complex multi-modal tasks.