AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation - Summary

Summary (Overview)

  • Key Contribution: Introduces AnyFlow, the first any-step video diffusion distillation framework based on a two-time flow map formulation. It enables a single model to support arbitrary inference budgets, trading latency for quality at test time without retraining.
  • Core Problem Solved: Addresses the limitation of consistency-distilled models, whose performance often degrades as more sampling steps are used. This is because consistency sampling replaces the original Probability-Flow ODE (PF-ODE) trajectory, weakening test-time scaling.
  • Novel Methodology: Proposes Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions (z_t → z_r). This enables efficient on-policy distillation to mitigate test-time errors (discretization error and exposure bias).
  • Empirical Validation: Demonstrates strong performance across bidirectional and causal video diffusion architectures, from 1.3B to 14B parameters. AnyFlow matches or surpasses consistency-based methods in few-step regimes and continues to improve with more sampling steps.
  • Practical Advantage: The distilled model preserves the pretrained instantaneous flow field, enabling continued training on downstream datasets for domain adaptation, a capability challenging for consistency-based distilled models.

Introduction and Theoretical Foundation

The goal is to develop video diffusion models that support flexible generation, allowing users to trade latency (few steps) for higher quality (more steps) at inference time. Existing few-step methods, predominantly based on Consistency Models (CMs), learn a fixed-point mapping from a noisy state z_t to the clean data z_0. While effective for few steps, their performance degrades with more steps because their sampling trajectory (which involves repeated re-noising of intermediate states) drifts away from the target PF-ODE path.

AnyFlow addresses this by shifting to a flow map formulation. Instead of learning only z_t → z_0, it learns transitions between arbitrary time pairs z_t → z_r. This formulation generalizes consistency modeling (when r = 0) and standard flow matching (when t = r), naturally supporting variable step sizes and inference budgets.

The theoretical foundation is the Probability-Flow ODE (PF-ODE):

dztdt=v(zt,t)(1)\frac{d\mathbf{z}_t}{dt} = \mathbf{v}(\mathbf{z}_t, t) \tag{1}

The exact flow map Φ_{r←t} is defined as the operator that transports states from time t to time r: Φ_{r←t}(z_t) = z_r for 1 ≥ t ≥ r ≥ 0. A neural flow map model learns an approximation:

fθ(zt,t,r)zr,1t>r0(2)\mathbf{f}_θ(\mathbf{z}_t, t, r) ≈ \mathbf{z}_r, \quad 1 ≥ t > r ≥ 0 \tag{2}

with the boundary condition f_θ(z_t, t, t) = z_t.

Methodology

The AnyFlow pipeline consists of two complementary stages: Forward Flow Map Training and On-Policy Flow Map Distillation.

1. Forward Flow Map Training

This stage converts a pretrained video diffusion model into a flow map model using an improved version of the MeanFlow objective. Key design modifications include:

  • Interpolated Timestep Conditioning: Uses g · emb(t) + (1-g) · emb'(r) (with g=0.25) instead of zero-initialized conditioning to prevent embedding norm explosion and over-saturated generation.
  • Guidance-Fused Training: Incorporates Classifier-Free Guidance (CFG) into the prediction to align with the pretrained model's guidance scale, allowing CFG to be omitted at inference. u=1g(uc(1g)sg(u))(6)\mathbf{u} = \frac{1}{g}(\mathbf{u}_c - (1-g) \text{sg}(\mathbf{u}_∅)) \tag{6}
  • Differential Derivation Equation: Uses a finite-difference approximation to compute the Jacobian-vector product term, compatible with FSDP training. ddtu(zt,r,t)u(zt+Δt,r,t+Δt)u(ztΔt,r,tΔt)2Δt(4)\frac{d}{dt} \mathbf{u}(\mathbf{z}_t, r, t) ≈ \frac{\mathbf{u}(\mathbf{z}_{t+Δt}, r, t+Δt) - \mathbf{u}(\mathbf{z}_{t-Δt}, r, t-Δt)}{2Δt} \tag{4}
  • Adaptive Loss Reweighting: Dynamically scales the loss for timesteps t ≠ r using the well-optimized loss at the boundary t = r as a baseline.

The training objective is:

L(θ)=E[uθ(zt,r,t)sg(utgt)22](3)\mathcal{L}(θ) = \mathbb{E}\left[ \| \mathbf{u}_θ(\mathbf{z}_t, r, t) - \text{sg}(\mathbf{u}_{\text{tgt}}) \|_2^2 \right] \tag{3}

where u_tgt = v(z_t, t) - (t - r) (d u_θ(z_t, r, t) / dt).

2. On-Policy Flow Map Distillation

To mitigate remaining test-time errors (discretization error, exposure bias), this stage performs Distribution Matching Distillation (DMD) on the student's own rollouts.

The core innovation is Flow Map Backward Simulation. Unlike consistency backward simulation, which must simulate every intermediate step, this method exploits the composition property of flow maps:

fθ(zt,t,q)fθ(fθ(zt,t,r),r,q),t>r>q(8)\mathbf{f}_θ(\mathbf{z}_t, t, q) ≈ \mathbf{f}_θ\left( \mathbf{f}_θ(\mathbf{z}_t, t, r), r, q \right), \quad t > r > q \tag{8}

For a target N-step budget, it decomposes a rollout T → 0 into three shortcut segments: T → t, t → r, and r → 0, where t - r = T/N. This allows efficient simulation of different inference budgets with fixed computation cost. The DMD gradient is:

θLDMD=Et,z[(sreal(zt,t)sfake(zt,t))fθ(z)θ](5)∇_θ \mathcal{L}_{\text{DMD}} = -\mathbb{E}_{t,\mathbf{z}} \left[ \left( s_{\text{real}}(\mathbf{z}_t, t) - s_{\text{fake}}(\mathbf{z}_t, t) \right) \frac{\partial \mathbf{f}_θ(\mathbf{z})}{\partial θ} \right] \tag{5}

Application to Architectures

  • Bidirectional Video Diffusion: Follows the standard AnyFlow pipeline.
  • Causal Video Diffusion: Adopts the FAR (Frame Autoregressive) training pipeline with context compression and a non-uniform chunk partition (first chunk size 1 for I2V conditioning, subsequent chunks size 3) to jointly support T2V and I2V generation.

Empirical Validation / Results

AnyFlow is evaluated on VBench for Text-to-Video (T2V) and VBench-I2V for Image-to-Video (I2V) tasks, using Wan2.1 backbones at 1.3B and 14B scales.

Quantitative Results

Table 3: Text-to-Video Evaluation on VBench

Model#ParamsNFEsQualitySemanticTotal
Bidirectional (14B)
Wan2.1-T2V-14B [1]14B50 × 285.7775.5883.74
rCM-Wan2.1-T2V-14B [9]14B485.4776.7283.73
AnyFlow-Wan2.1-T2V-14B14B485.7077.3884.04
AnyFlow-Wan2.1-T2V-14B14B3285.7677.4484.10
Causal (14B)
Krea-Realtime-Wan2.1-14B [43]14B484.8077.0783.25
AnyFlow-FAR-Wan2.1-14B14B485.8276.9784.05
AnyFlow-FAR-Wan2.1-14B14B3286.1277.5584.41

Table们 4: Image-to-Video Evaluation on VBench-I2V

Model#ParamsNFEsQualityI2VTotal
Wan2.1-I2V-14B [1]14B50 × 280.3095.1287.71
AnyFlow-FAR-Wan2.1-14B14B480.3995.3587.87

Key Findings:

  1. Any-Step Scaling: AnyFlow performance improves or remains stable as NFEs increase (e.g., 84.05 → 84.41 for causal 14B), while consistency-based methods (rCM, Self-Forcing) degrade.
  2. Few-Step Competitiveness: At 4 NFEs, AnyFlow outperforms strong consistency-based counterparts (rCM, Self-Forcing) and other community models (Krea-Realtime, FastVideo).
  3. Efficiency: AnyFlow achieves quality comparable to the 50×2-step teacher model using only 4 or 32 NFEs.

Ablation Studies

Table 2: Quantitative Ablation of Key Designs

MethodNFEsBidirectional (Overall)Causal (Overall)
Forward Training
Flow Map Training481.7580.48
Flow Map Training3283.4083.13
Forward + On-Policy Distillation
Flow Map Training + Consistency Backward Sim.483.5582.99
Flow Map Training + Consistency Backward Sim.3282.9683.49
Flow Map Training + Flow Map Backward Sim. (Ours)483.4883.54
Flow Map Training + Flow Map Backward Sim. (Ours)3283.9683.96

Key Ablation Insights:

  • Flow Map Training provides a stronger initialization than Flow Matching or Consistency ODE-Init.
  • On-policy distillation is crucial for mitigating test-time errors.
  • Flow Map Backward Simulation is superior to Consistency Backward Simulation, especially for maintaining performance at higher NFEs (32).

Training Cost Analysis (Table 5): Flow Map Backward Simulation has a slightly higher cost than consistency simulation at 4 steps (+15.7% causal, +22.5% bidirectional) but becomes significantly more efficient at simulating larger step counts (e.g., -47.0% cost at 16 steps for bidirectional).

Theoretical and Practical Implications

  • Theoretical: Provides a unified framework that bridges consistency models and flow matching via the generalized flow map formulation. The composition property of flow maps is leveraged for efficient trajectory simulation and any-step inference.
  • Practical:
    1. Flexible Inference: Enables a single model to serve diverse latency-quality requirements, from quick previews to high-fidelity final outputs.
    2. Scalable Training: The method works across model scales (1.3B to 14B) and architectures (bidirectional & causal).
    3. Domain Adaptation: The preserved instantaneous flow field allows for continued fine-tuning on downstream datasets (e.g., robotics, driving), bypassing the need for costly retraining of the full causal generator—a significant advantage over methods like Self-Forcing.

Conclusion

AnyFlow establishes a new paradigm for any-step video diffusion distillation based on flow maps. By learning transitions between arbitrary time pairs and employing efficient on-policy distillation via flow map backward simulation, it achieves strong few-step performance while preserving the desirable test-time scaling behavior of ODE sampling. Extensive validation demonstrates its effectiveness and scalability. A key practical benefit is the model's compatibility with continued training for domain-specific adaptation.

Limitations & Future Work: The method relies on synthetic data for flow map training, which may introduce a mild distribution shift. Future work could focus on developing more robust forward training strategies and extending the framework to autoregressive long-video generation.