Flow-OPD: On-Policy Distillation for Flow Matching Models

Summary (Overview)

  • Key Problem: Flow Matching (FM) text-to-image models suffer from reward sparsity and gradient interference when aligned with multiple heterogeneous objectives (e.g., text rendering, aesthetics), leading to a "seesaw effect" where metrics compete and reward hacking occurs.
  • Core Solution: Introduces Flow-OPD, the first framework to integrate On-Policy Distillation (OPD) into FM models. It uses a two-stage alignment strategy: 1) training domain-specialized teacher models via single-reward GRPO, and 2) consolidating their expertise into a single student via on-policy sampling, task-routing labeling, and dense trajectory-level supervision.
  • Novel Regularization: Proposes Manifold Anchor Regularization (MAR) to mitigate aesthetic degradation by anchoring the student's generation to a high-quality manifold using a task-agnostic teacher.
  • Key Results: On Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and OCR accuracy from 59 to 94, achieving an overall improvement of roughly 10 points over vanilla GRPO while preserving image fidelity and human preference.
  • Emergent Effect: The unified student model exhibits a "teacher-surpassing" effect, matching or exceeding the performance of specialized teachers and demonstrating robust out-of-distribution generalization.

Introduction and Theoretical Foundation

Flow Matching (FM) has emerged as a superior paradigm for generative modeling, offering efficiency and high-fidelity synthesis. However, aligning FM models with multiple, heterogeneous tasks (text rendering, compositional reasoning, aesthetics) presents significant challenges. Existing methods like Group Relative Policy Optimization (GRPO) ported to FM show promise in single-reward scenarios but fail in multi-task settings due to:

  1. Reward Sparsity: Scalar-valued rewards lack granularity to harmonize conflicting objectives.
  2. Gradient Interference: Joint optimization of heterogeneous objectives leads to conflicting gradients, causing a "seesaw effect" and catastrophic forgetting.

The paper draws inspiration from the success of On-Policy Distillation (OPD) in Large Language Models (LLMs), where specialized expert knowledge is distilled into a unified student. The core question addressed is: Can FM models similarly leverage OPD to integrate diverse teacher strengths?

Preliminaries:

  • Flow Matching (FM) maps a noise distribution p0p_0 to data pdatap_{\text{data}} via an ODE dxt=vt(xt,t)dtd x_t = v_t ( x_t , t ) d t. Under the Optimal Transport (OT) formulation, the path is xt=(1t)x0+tx1x_t = (1 - t) x_0 + t x_1, and the model vθv_\theta learns the constant velocity (x1x0)( x_1 - x_0 ) via: LFM(θ)=Et,x0,x1[vθ(xt,t)(x1x0)2]L_{\text{FM}} ( \theta ) = E_{t, x_0 , x_1} \left[ \| v_\theta ( x_t , t ) - ( x_1 - x_0 ) \|^2 \right] (Equation 1)
  • On-Policy Distillation (OPD) for autoregressive models minimizes the Reverse KL divergence between student and teacher distributions over student-generated trajectories: LOPD=Eyπθ[logπteacher(yx)πθ(yx)]=DKL(πθπteacher)L_{\text{OPD}} = - E_{y \sim \pi_\theta} \left[ \log \frac{\pi_{\text{teacher}} ( y | x )}{\pi_\theta ( y | x )} \right] = D_{\text{KL}} ( \pi_\theta \| \pi_{\text{teacher}} ) (Equation 2)

Methodology

Flow-OPD is a two-stage framework: 1) Cultivate specialized teachers, and 2) Distill expertise into a unified student.

Stage 1: Teacher Training

Domain-expert teachers are trained using single-reward GRPO fine-tuning, allowing each to reach its performance ceiling in isolation (e.g., a teacher for OCR, another for PickScore).

Stage 2: Student Training via Multi-Teacher On-Policy Distillation

This stage involves three key components:

1. Cold Start: To establish a robust initial policy for the student, two strategies are explored:

  • SFT-based initialization: Uses trajectories sampled from specialized teachers.
  • Model-merging initialization: Superposes the parameters of divergent teachers into a unified state.

2. On-Policy Sampling & Task-Routing Labeling: To facilitate exploration, the deterministic ODE is converted to a Stochastic Differential Equation (SDE):

dxt=[vθ(xt,t)+σt22t(xt+(1t)vθ(xt,t))]dt+σtdwd x_t = \left[ v_\theta ( x_t , t ) + \frac{\sigma_t^2}{2 t} ( x_t + (1 - t) v_\theta ( x_t , t )) \right] d t + \sigma_t d w

(Equation 5)

This yields a stochastic behavioral policy πθ(xtΔtxt,c)=N(μθ(xt,t),σt2ΔtI)\pi_\theta ( x_{t-\Delta t} | x_t , c ) = N ( \mu_\theta ( x_t , t ) , \sigma_t^2 \Delta tI ) (Equation 6). For each explored state xtx_t, a hard routing mechanism 1T(c)=k\mathbb{1}_{T(c)=k} maps the textual condition cc to a single domain expert kk, which provides the reference velocity field vϕk(xt,t,c)v_{\phi_k} ( x_t , t, c ). The target flow is:

vtarget(xt,t,c)=vϕk(xt,t,c),where k=R(c)v_{\text{target}} ( x_t , t, c ) = v_{\phi_k} ( x_t , t, c ), \quad \text{where } k = R(c)

(Equation 7)

3. Dense KL Reward & Policy Update: The KL divergence between the student and target transition policies (both Gaussian with shared covariance) reduces to an L2 distance between their means/vector fields:

DKL(πθπtarget)=Δt2(σt(1t)2t+1σt)2vθ(xt,t,c)vtarget(xt,t,c)2D_{\text{KL}} ( \pi_\theta \| \pi_{\text{target}} ) = \frac{\Delta t}{2} \left( \frac{\sigma_t (1 - t)}{2 t} + \frac{1}{\sigma_t} \right)^2 \| v_\theta ( x_t , t, c ) - v_{\text{target}} ( x_t , t, c ) \|^2

(Equation 9)

The immediate dense reward rt(i)r_t^{(i)} for the ii-th trajectory is defined using the detached student vector field vˉθ\bar{v}_\theta:

rt(i)=w(t)vˉθ(xt(i),t,c)vtarget(xt(i),t,c)2r_t^{(i)} = - w(t) \| \bar{v}_\theta ( x_t^{(i)} , t, c ) - v_{\text{target}} ( x_t^{(i)} , t, c ) \|^2

(Equation 10)

A clipped policy gradient update (PPO-style) is applied using this dense reward. For a batch of BB prompts, each generating GG trajectories, the surrogate objective is:

J(θ)1B×Gj=1Bi=1Gt=0Tmin(ρt,i,j(θ)rt,i,jOPD,clip(ρt,i,j(θ),1ϵ,1+ϵ)rt,i,jOPD)J ( \theta ) \approx \frac{1}{B \times G} \sum_{j=1}^{B} \sum_{i=1}^{G} \sum_{t=0}^{T} \min \left( \rho_{t,i,j} ( \theta ) r_{t,i,j}^{\text{OPD}}, \text{clip} \left( \rho_{t,i,j} ( \theta ), 1 - \epsilon, 1 + \epsilon \right) r_{t,i,j}^{\text{OPD}} \right)

(Equation 11) where ρt,i,j(θ)=πθ(at,i,jst,i,j)πθold(at,i,jst,i,j)\rho_{t,i,j} ( \theta ) = \frac{\pi_\theta ( a_{t,i,j} | s_{t,i,j} )}{\pi_{\theta_{\text{old}}} ( a_{t,i,j} | s_{t,i,j} )} is the policy ratio.

4. Manifold Anchor Regularization (MAR): To prevent aesthetic degradation, a frozen aesthetic teacher (e.g., optimized via DeQA) provides a regularizing vector field vbasev_{\text{base}}. The total loss combines the policy loss LPolicy(θ)L_{\text{Policy}}(\theta) (negative of J(θ)J(\theta)) and a dense KL penalty:

LTotal(θ)=LPolicy(θ)+λEc,t,xtρtθ[w(t)vθ(xt,t,c)vaesthetic(xt,t,c)2]L_{\text{Total}} ( \theta ) = L_{\text{Policy}} ( \theta ) + \lambda E_{c,t,x_t \sim \rho_t^\theta} \left[ w(t) \| v_\theta ( x_t , t, c ) - v_{\text{aesthetic}} ( x_t , t, c ) \|^2 \right]

(Equation 12)

Empirical Validation / Results

Experiments are conducted on Stable Diffusion 3.5 Medium across four tasks: GenEval (composition), OCR (text rendering), PickScore (human preference), and DeQA (image quality).

Baselines: Monolithic-Reward GRPO (GRPO-[reward]), and Hybrid-Reward GRPO (GRPO-Mix) with fixed reward ratio.

Quantitative Results

Table 2: Model Performance Comparison

ModelGenEvalOCR Acc.DEQAPickScoreAvg
SD-3.5-M0.630.594.0721.640.7166
+GRPO-Geneval0.940.654.0121.530.8050
+GRPO-OCR0.640.924.0621.690.8016
+GRPO-deqa0.640.664.2323.020.7578
+GRPO-Pickscore0.510.694.2223.190.7340
GRPO-Mix0.730.834.3321.840.8165
SFT+GRPO-Mix0.850.864.2921.790.7166
Merge+GRPO-Mix0.840.864.1821.870.7166
Ours (SFT)0.910.924.2921.830.8819
Ours (Merge)0.920.944.3523.080.9044

Flow-OPD (Merge) achieves the best overall performance, significantly outperforming all baselines and matching/surpassing specialized teachers.

Key Findings:

  1. Flow-OPD outperforms vanilla GRPO: Achieves ~10-point improvement on average normalized metrics.
  2. Resolves gradient interference: Avoids the catastrophic forgetting seen in multi-reward GRPO (Table 1).
  3. Cold-start is crucial: Both SFT and merging initialization provide a strong foundation, with merging yielding slightly better results (Fig. 4).
  4. Teacher-surpassing effect: The student model synthesizes knowledge from multiple teachers, leading to emergent superiority in some cases.
  5. Robust OOD generalization: Flow-OPD shows superior performance on the T2I-CompBench benchmark (Table 3).
  6. MAR preserves aesthetics: Quantitative (Table 4) and qualitative (Fig. 5) results show MAR effectively prevents background mode collapse and semantic redundancy, maintaining high visual quality.

Table 4: Performance on General Image Quality and Alignment Metrics

ModelImageRewardAestheticUnifiedRewardHPS-v2.1QwenVL Score
SD-3.5-M1.025.873.3390.29823.45
GRPO-DeQA1.335.973.4560.28463.68
GRPO-mix1.235.933.5010.31013.88
w.o. MAR1.265.893.5180.29983.82
Ours (Merge)1.366.233.6590.33024.05

Theoretical and Practical Implications

  • Theoretical: Provides a formal framework for translating OPD (a discrete-sequence PG method) to the continuous-time FM domain via the derivation of a dense KL reward from vector field discrepancies. It analytically addresses the problems of reward sparsity (θJk,θJ1<0)\left( \langle \nabla_\theta J_k, \nabla_\theta J_1 \rangle < 0 \right) and gradient interference in multi-task learning.
  • Practical: Offers a scalable alignment paradigm for building generalist text-to-image models. By decoupling expertise acquisition (teacher training) from model unification (distillation), it enables the consolidation of diverse capabilities without degradation. The MAR component ensures that functional alignment does not compromise aesthetic quality, making the framework suitable for real-world applications demanding both precision and visual appeal.

Conclusion

Flow-OPD successfully integrates on-policy distillation into Flow Matching models, overcoming the fundamental limitations of sparse-reward multi-task alignment. By replacing scalar rewards with dense, trajectory-level supervision from multiple teachers and anchoring generation to a high-quality manifold via MAR, it breaks the "seesaw effect" and achieves superior performance across composition, text rendering, and aesthetics. The framework demonstrates scalability, robust generalization, and an emergent teacher-surpassing effect, establishing a new paradigm for developing high-capability, generalist text-to-image models. Future work may explore extending this approach to other generative domains and more complex teacher ensembles.

Related papers