Flow-OPD: On-Policy Distillation for Flow Matching Models
Summary (Overview)
- Key Problem: Flow Matching (FM) text-to-image models suffer from reward sparsity and gradient interference when aligned with multiple heterogeneous objectives (e.g., text rendering, aesthetics), leading to a "seesaw effect" where metrics compete and reward hacking occurs.
- Core Solution: Introduces Flow-OPD, the first framework to integrate On-Policy Distillation (OPD) into FM models. It uses a two-stage alignment strategy: 1) training domain-specialized teacher models via single-reward GRPO, and 2) consolidating their expertise into a single student via on-policy sampling, task-routing labeling, and dense trajectory-level supervision.
- Novel Regularization: Proposes Manifold Anchor Regularization (MAR) to mitigate aesthetic degradation by anchoring the student's generation to a high-quality manifold using a task-agnostic teacher.
- Key Results: On Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and OCR accuracy from 59 to 94, achieving an overall improvement of roughly 10 points over vanilla GRPO while preserving image fidelity and human preference.
- Emergent Effect: The unified student model exhibits a "teacher-surpassing" effect, matching or exceeding the performance of specialized teachers and demonstrating robust out-of-distribution generalization.
Introduction and Theoretical Foundation
Flow Matching (FM) has emerged as a superior paradigm for generative modeling, offering efficiency and high-fidelity synthesis. However, aligning FM models with multiple, heterogeneous tasks (text rendering, compositional reasoning, aesthetics) presents significant challenges. Existing methods like Group Relative Policy Optimization (GRPO) ported to FM show promise in single-reward scenarios but fail in multi-task settings due to:
- Reward Sparsity: Scalar-valued rewards lack granularity to harmonize conflicting objectives.
- Gradient Interference: Joint optimization of heterogeneous objectives leads to conflicting gradients, causing a "seesaw effect" and catastrophic forgetting.
The paper draws inspiration from the success of On-Policy Distillation (OPD) in Large Language Models (LLMs), where specialized expert knowledge is distilled into a unified student. The core question addressed is: Can FM models similarly leverage OPD to integrate diverse teacher strengths?
Preliminaries:
- Flow Matching (FM) maps a noise distribution to data via an ODE . Under the Optimal Transport (OT) formulation, the path is , and the model learns the constant velocity via: (Equation 1)
- On-Policy Distillation (OPD) for autoregressive models minimizes the Reverse KL divergence between student and teacher distributions over student-generated trajectories: (Equation 2)
Methodology
Flow-OPD is a two-stage framework: 1) Cultivate specialized teachers, and 2) Distill expertise into a unified student.
Stage 1: Teacher Training
Domain-expert teachers are trained using single-reward GRPO fine-tuning, allowing each to reach its performance ceiling in isolation (e.g., a teacher for OCR, another for PickScore).
Stage 2: Student Training via Multi-Teacher On-Policy Distillation
This stage involves three key components:
1. Cold Start: To establish a robust initial policy for the student, two strategies are explored:
- SFT-based initialization: Uses trajectories sampled from specialized teachers.
- Model-merging initialization: Superposes the parameters of divergent teachers into a unified state.
2. On-Policy Sampling & Task-Routing Labeling: To facilitate exploration, the deterministic ODE is converted to a Stochastic Differential Equation (SDE):
(Equation 5)
This yields a stochastic behavioral policy (Equation 6). For each explored state , a hard routing mechanism maps the textual condition to a single domain expert , which provides the reference velocity field . The target flow is:
(Equation 7)
3. Dense KL Reward & Policy Update: The KL divergence between the student and target transition policies (both Gaussian with shared covariance) reduces to an L2 distance between their means/vector fields:
(Equation 9)
The immediate dense reward for the -th trajectory is defined using the detached student vector field :
(Equation 10)
A clipped policy gradient update (PPO-style) is applied using this dense reward. For a batch of prompts, each generating trajectories, the surrogate objective is:
(Equation 11) where is the policy ratio.
4. Manifold Anchor Regularization (MAR): To prevent aesthetic degradation, a frozen aesthetic teacher (e.g., optimized via DeQA) provides a regularizing vector field . The total loss combines the policy loss (negative of ) and a dense KL penalty:
(Equation 12)
Empirical Validation / Results
Experiments are conducted on Stable Diffusion 3.5 Medium across four tasks: GenEval (composition), OCR (text rendering), PickScore (human preference), and DeQA (image quality).
Baselines: Monolithic-Reward GRPO (GRPO-[reward]), and Hybrid-Reward GRPO (GRPO-Mix) with fixed reward ratio.
Quantitative Results
Table 2: Model Performance Comparison
| Model | GenEval | OCR Acc. | DEQA | PickScore | Avg |
|---|---|---|---|---|---|
| SD-3.5-M | 0.63 | 0.59 | 4.07 | 21.64 | 0.7166 |
| +GRPO-Geneval | 0.94 | 0.65 | 4.01 | 21.53 | 0.8050 |
| +GRPO-OCR | 0.64 | 0.92 | 4.06 | 21.69 | 0.8016 |
| +GRPO-deqa | 0.64 | 0.66 | 4.23 | 23.02 | 0.7578 |
| +GRPO-Pickscore | 0.51 | 0.69 | 4.22 | 23.19 | 0.7340 |
| GRPO-Mix | 0.73 | 0.83 | 4.33 | 21.84 | 0.8165 |
| SFT+GRPO-Mix | 0.85 | 0.86 | 4.29 | 21.79 | 0.7166 |
| Merge+GRPO-Mix | 0.84 | 0.86 | 4.18 | 21.87 | 0.7166 |
| Ours (SFT) | 0.91 | 0.92 | 4.29 | 21.83 | 0.8819 |
| Ours (Merge) | 0.92 | 0.94 | 4.35 | 23.08 | 0.9044 |
Flow-OPD (Merge) achieves the best overall performance, significantly outperforming all baselines and matching/surpassing specialized teachers.
Key Findings:
- Flow-OPD outperforms vanilla GRPO: Achieves ~10-point improvement on average normalized metrics.
- Resolves gradient interference: Avoids the catastrophic forgetting seen in multi-reward GRPO (Table 1).
- Cold-start is crucial: Both SFT and merging initialization provide a strong foundation, with merging yielding slightly better results (Fig. 4).
- Teacher-surpassing effect: The student model synthesizes knowledge from multiple teachers, leading to emergent superiority in some cases.
- Robust OOD generalization: Flow-OPD shows superior performance on the T2I-CompBench benchmark (Table 3).
- MAR preserves aesthetics: Quantitative (Table 4) and qualitative (Fig. 5) results show MAR effectively prevents background mode collapse and semantic redundancy, maintaining high visual quality.
Table 4: Performance on General Image Quality and Alignment Metrics
| Model | ImageReward | Aesthetic | UnifiedReward | HPS-v2.1 | QwenVL Score |
|---|---|---|---|---|---|
| SD-3.5-M | 1.02 | 5.87 | 3.339 | 0.2982 | 3.45 |
| GRPO-DeQA | 1.33 | 5.97 | 3.456 | 0.2846 | 3.68 |
| GRPO-mix | 1.23 | 5.93 | 3.501 | 0.3101 | 3.88 |
| w.o. MAR | 1.26 | 5.89 | 3.518 | 0.2998 | 3.82 |
| Ours (Merge) | 1.36 | 6.23 | 3.659 | 0.3302 | 4.05 |
Theoretical and Practical Implications
- Theoretical: Provides a formal framework for translating OPD (a discrete-sequence PG method) to the continuous-time FM domain via the derivation of a dense KL reward from vector field discrepancies. It analytically addresses the problems of reward sparsity and gradient interference in multi-task learning.
- Practical: Offers a scalable alignment paradigm for building generalist text-to-image models. By decoupling expertise acquisition (teacher training) from model unification (distillation), it enables the consolidation of diverse capabilities without degradation. The MAR component ensures that functional alignment does not compromise aesthetic quality, making the framework suitable for real-world applications demanding both precision and visual appeal.
Conclusion
Flow-OPD successfully integrates on-policy distillation into Flow Matching models, overcoming the fundamental limitations of sparse-reward multi-task alignment. By replacing scalar rewards with dense, trajectory-level supervision from multiple teachers and anchoring generation to a high-quality manifold via MAR, it breaks the "seesaw effect" and achieves superior performance across composition, text rendering, and aesthetics. The framework demonstrates scalability, robust generalization, and an emergent teacher-surpassing effect, establishing a new paradigm for developing high-capability, generalist text-to-image models. Future work may explore extending this approach to other generative domains and more complex teacher ensembles.
Related papers
- MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
M3 with MaxProof achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding human gold-medal thresholds.
- Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models
Wan-Streamer achieves sub-second full-duplex audio-visual interaction with 200 ms model-side latency using a single causal Transformer.
- Playful Agentic Robot Learning
RAT_S achieves 20.6% higher success on LIBERO-PRO by acquiring reusable skills through curiosity-driven self-directed play before downstream tasks.