# DanceOPD: On-Policy Generative Field Distillation

> DanceOPD composes multiple generative capabilities into one flow-matching model via on-policy field distillation, improving edit composition by 8-16%.

- **Source:** [arXiv](https://arxiv.org/abs/2606.27377)
- **Published:** 2026-06-27
- **Permalink:** https://picx.dev/p/gq84gZ
- **Whiteboard:** https://picx.dev/p/gq84gZ/image

## Summary

## Summary (Overview)

- **Novel framework**: DanceOPD is an on-policy generative field distillation framework for flow-matching models that composes multiple heterogeneous generative capabilities (text-to-image, local editing, global editing, realism, classifier-free guidance) into a single student model.
- **Three key design choices**: Hard-routed sample-wise field matching (preserving semantic identity of each capability), on-policy field querying on student-visited states (resolving state-distribution mismatch), and a single semantic-side low-noise query per sample (avoiding trajectory-query correlation).
- **Simple objective**: Plain velocity MSE loss on the routed teacher field queried at student-rolled states.
- **Strong empirical results**: Improves GEditBench average over best OPD baseline by 8.1% (T2I+Edit composition) and 16.1% (Local+Global Edit composition); closes 85.3% of the student-to-teacher realism reward gap; successfully absorbs classifier-free guidance into the student.
- **Ablations validate design**: Hard routing outperforms soft teacher mixing by 15.2% (MSE), semantic-side low-noise queries improve over median/high-noise by 23.7%/19.5%, and single-query outperforms dense trajectory queries by up to 22.8%.

## Introduction and Theoretical Foundation

Modern image generation increasingly demands a single model that unifies diverse capabilities: text-to-image (T2I), local editing (preserving source while applying precise changes), and global editing (changing broad appearance like style/color/layout). These capabilities are **naturally incompatible**:
- T2I rewards open-ended visual quality and prompt following.
- Local editing requires preservation with precise changes.
- Global editing requires broad transformation.

Naively optimizing them together leads to **capability interference**. Existing combination paradigms only partially address this:
- **Data mixing/joint training** dilutes capability-specific supervision and suffers from gradient conflict.
- **Parameter-space merging/adapter composition** yields compromise solutions.
- **Inference-time score composition** leaves composition external to the deployed student.

The paper adopts a **field-based perspective**: each frozen capability source defines a **velocity field** $v_m(z_t, t, c)$ over a shared generative state space. Capability composition becomes a **field-query problem** with three coupled choices:
1. Which capability field should supervise a given sample?
2. Where in the state space should the field be queried?
3. How many states from the student rollout should be used for supervision?

These choices address three alignment challenges:
- **Target-field ambiguity**: linearly combining fields within one sample produces a target that does not correspond to any well-defined capability query.
- **State-distribution mismatch**: evaluating fields on off-policy (data/teacher) states leaves the student under-supervised on its own visitation distribution.
- **Trajectory-query correlation**: dense supervision from the same rollout shares noise seed, prompt, and path history, causing correlated gradients.

## Methodology

### 3.1 Preliminary
Given $M \geq 2$ frozen capability sources $\{v_m\}_{m=1}^M$ defined over the same generative state space, each source defines a capability-specific velocity field:
$$
v_m(z_t, t, c), \quad m \in \{1, \dots, M\}
$$
where $z_t$ is a flow state at time $t$, $c$ is conditioning (text prompt, source image, edit instruction, etc.).

### 3.2 Hard-Routed Sample-Wise Field Matching
Each training sample is **routed to exactly one** capability field:
$$
m \sim \pi(m), \quad (x, c) \sim D_m
$$
where $\pi(m)$ is the route probability (uniform over active buckets by default). The routed target field is:
$$
u_m(z, t, c) = v_m(z, t, c)
$$
This preserves the semantic identity of each capability query, avoiding target-field ambiguity from soft multi-teacher mixing.

### 3.3 On-Policy Field Querying
The student first generates its own trajectory:
$$
z^\theta_{0:T} = \text{Rollout}(v^\theta; z_T, c), \quad z_T \sim p_T
$$
Then the routed field is queried on a **stop-gradient student state** $\bar{z}_t = \text{sg}(z^\theta_t)$, aligning supervision with the student's own visitation distribution. This addresses state-distribution mismatch.

### 3.4 Semantic-Side Single Query
Only **one** low-noise (semantic-side) state is queried per sample:
$$
K = 1, \quad s \sim q_{\text{sem}}(s), \quad t = t(s)
$$
where $q_{\text{sem}}$ is biased toward low-noise states (e.g., $\text{Beta}(5,2)$). Low-noise states concentrate capability-specific information (style, aesthetics, edit details) and avoid correlated gradient signals from multiple trajectory states.

### 3.5 Objective Design
The default objective is plain **velocity MSE** on the routed, on-policy query:
$$
\mathcal{L}_{\text{DanceOPD}} = \mathbb{E}_{m \sim \pi, (x,c) \sim D_m, z_T \sim p_T, s \sim q_{\text{sem}}} \left[ \| v^\theta(\bar{z}_t, t, c) - v_m(\bar{z}_t, t, c) \|_2^2 \right], \quad t = t(s)
$$
This is the natural regression objective for deterministic velocity fields. Under a local Gaussian transition view, KL-style field matching reduces to a weighted MSE (derived in Appendix Sec. 7.1):
$$
D_{\text{KL}}(p_m \| p^\theta) = \frac{\Delta t^2}{2\sigma_t^2} \| v^\theta(z_t, t, c) - v_m(z_t, t, c) \|_2^2
$$

The framework also subsumes **operator-defined fields** like classifier-free guidance:
$$
v_\alpha(z_t, t, c) = v_\emptyset(z_t, t) + \alpha \left[ v_{\text{cond}}(z_t, t, c) - v_\emptyset(z_t, t) \right]
$$

### Algorithm 1
```
A. Route One Capability Query
   m ~ π(m), (x, c) ~ D_m
B. Query On The Student Trajectory
   z_T ~ p_T, z^θ_{0:T} ← Rollout(v^θ; z_T, c)
   s ~ q_sem(s), t ← t(s), \bar{z}_t ← sg(z^θ_t)
   u ← v_m(\bar{z}_t, t, c)                # frozen routed field
C. Match The Local Velocity Field
   L ← ‖v^θ(\bar{z}_t, t, c) - u‖²_2
   θ ← OptStep(θ, ∇_θ L)
```

## Empirical Validation / Results

### Main Results (Table 2)
DanceOPD is evaluated on two composition settings:

**A. T2I and Edit Composition** (T2I teacher + Edit teacher):
- DanceOPD achieves **GEditBench average 5.347**, improving over best OPD baseline (DiffusionOPD: 4.947) by **8.1%** and over the edit source (4.930) by **8.5%**.
- GenEval overall **0.849**, improving over T2I source (0.832) by **2.0%** and over strongest composition baseline (0.833) by **1.9%**.

**B. Local and Global Edit Composition** (Local Edit teacher + Global Edit teacher):
- DanceOPD achieves **GEditBench average 5.498**, improving over best competing baseline (Flow-OPD: 4.679) by **16.1%** and over local edit source (5.095) by **7.9%**.
- GenEval overall **0.848**, above all composition baselines.

| Method | Subj-Add | Subj-Rep | Bg-Chg | Style-Chg | Color-Alt | Subj-Rem | Avg | Overall |
|--------|----------|----------|--------|-----------|-----------|----------|-----|---------|
| **A. T2I and Edit Composition** | | | | | | | | |
| Joint Training | 5.386 | 5.627 | 4.283 | 3.688 | 3.624 | 5.093 | 4.617 | 0.808 |
| Weight Merge | 0.573 | 0.315 | 0.277 | 0.557 | 0.221 | 0.123 | 0.344 | 0.836 |
| Off-Policy Distill. | 4.882 | 5.026 | 4.289 | 4.497 | 4.679 | 3.797 | 4.528 | 0.818 |
| DiffusionOPD | 5.488 | 5.850 | 4.242 | 4.303 | 4.588 | 5.211 | 4.947 | 0.833 |
| Flow-OPD | 6.014 | 5.214 | 4.467 | 3.957 | 4.793 | 4.681 | 4.854 | 0.814 |
| **DanceOPD (Ours)** | **5.681** | **5.857** | **5.173** | **5.218** | **4.840** | **5.310** | **5.347** | **0.849** |
| **B. Local and Global Edit Composition** | | | | | | | | |
| Joint Training | 4.632 | 5.393 | 4.128 | 3.941 | 4.093 | 5.086 | 4.546 | 0.821 |
| Weight Merge | 4.434 | 4.776 | 4.380 | 5.263 | 5.206 | 4.229 | 4.715 | 0.811 |
| Off-Policy Distill. | 5.008 | 4.683 | 4.543 | 4.772 | 5.075 | 4.336 | 4.736 | 0.798 |
| DiffusionOPD | 4.704 | 5.310 | 4.502 | 4.012 | 4.977 | 4.462 | 4.661 | 0.822 |
| Flow-OPD | 4.524 | 4.647 | 4.610 | 5.232 | 5.037 | 4.025 | 4.679 | 0.827 |
| **DanceOPD (Ours)** | **5.178** | **5.549** | **6.153** | **5.944** | **5.812** | **4.348** | **5.498** | **0.848** |

### Realism-Field Absorption
- Improves realism reward over off-policy distillation by **9.9%**.
- Closes **85.3%** of the student-to-teacher reward gap.
- Maintains T2I score within **0.1%** of off-policy distillation, above student anchor by **7.6%**.

### CFG Absorption
- Best composition improves GEditBench over train-only absorption by **7.6%** and over eval-only CFG by **1.4%**.
- Overcomposition (absorbed + external CFG) reduces score by **31.2%** relative to best.

### Ablation Studies
Key findings (at 2k steps unless noted):
- **Hard routing vs. soft mixing**: Hard routing improves average by **15.2%** (MSE) and **10.6%** (KL).
- **Semantic-side (low-$t$) queries**: Improve over median-$t$ by **23.7%** and high-$t$ by **19.5%**.
- **Number of trajectory queries**: Single query ($K=1$) outperforms weighted dense variants ($K=2,4,8,16$) by **16.6%, 7.9%, 10.2%, 12.2%** respectively.
- **Objective**: Plain MSE improves over KL-$\bar{\sigma}^2$ by **4.5%** and over DMD2 variants by **15.6–21.1%**.
- **Initialization**: Local edit initialization improves over merged by **37.2%**, over global by **112.8%**, over T2I by **204.4%**.

## Theoretical and Practical Implications

- **Theoretical**: The paper formalizes multi-capability composition as a field-query problem with three alignment challenges. It provides a KL-MSE equivalence for velocity-field distillation (Appendix Sec. 7.1), an on-policy mismatch bound (Sec. 7.2), a smoothness bound for capability fields on student-rolled states (Sec. 7.3), and analysis of target-field bias (Sec. 7.4) and dense-query correlation (Sec. 7.8–7.9).

- **Practical**: DanceOPD provides a scalable, computationally efficient approach for post-training unification of generative capabilities. Key practical advantages:
  - **Computational cost**: Single low-noise query per sample; per-step cost is $N C_{\text{roll}} + 1 \cdot C_{\text{grad}}$, cheaper than dense-query OPD variants.
  - **No inference overhead**: Capability fields are absorbed into the student, eliminating need for external score composition.
  - **CFG absorption**: Allows internalizing guidance into a single forward pass, reducing inference cost.
  - **Realism absorption**: Can improve visual quality without sacrificing base generation capability.

- **Limitations**: Assumes frozen capability sources with compatible velocity fields (same backbone, latent space, scheduler). Predefined routing may not handle ambiguous task boundaries; extension with verifier/reward-based routing is suggested.

## Conclusion

DanceOPD is an on-policy generative field distillation framework for composing heterogeneous generative capabilities (T2I, local/global editing, realism, CFG) into a single flow-matching student model. The key contributions are:
1. Treating each frozen capability as a **velocity field** over a shared state space.
2. **Hard-routed sample-wise field matching** to preserve semantic identity.
3. **On-policy querying** on student-visited states to resolve distribution mismatch.
4. **Single semantic-side low-noise query** to avoid correlated trajectory supervision.
5. **Plain velocity MSE** as the natural matching objective.

Experiments demonstrate consistent improvements over joint training, weight merging, off-policy distillation, and prior OPD methods across four settings (T2I+Edit, Local+Global Edit, realism absorption, CFG absorption). Ablations validate each design choice.

**Future directions**: Extending to ambiguous task boundaries with verifier-based routing; supporting diverse backbones beyond shared architectures; scaling to more capability fields.

---

_Markdown view of https://picx.dev/p/gq84gZ, served by PicX — AI-generated visual whiteboard summaries of research papers._
