KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Summary (Overview)

  • KVPO is a novel online Group Relative Policy Optimization (GRPO) framework designed to align streaming autoregressive (AR) video generators with human preferences while respecting their native deterministic ODE dynamics.
  • It introduces Causal-Semantic Exploration via Causal History Routing (CHR), which generates diverse candidate branches by stochastically routing historical Key-Value (KV) cache entries, ensuring on-manifold and semantically meaningful variation.
  • It defines an ODE-native surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in the flow-matching velocity-field space, leading to a reward-weighted contrastive flow-matching objective.
  • Experiments on distilled AR models (LongLive, MemFlow) show consistent improvements in Visual Quality (VQ), Motion Quality (MQ), and Text-Video Alignment (TA) for both short and long video generation, outperforming prior methods like Astrolabe.
  • The method avoids the pitfalls of noise-based SDE exploration (off-manifold distortion, low-level perturbation) and geometric distance-based surrogate policies, offering a principled alternative for ODE-based preference alignment.

Introduction and Theoretical Foundation

Recent advances have distilled pretrained video diffusion models into efficient, few-step autoregressive (AR) video generators that enable streaming inference via causal attention and KV caching. However, aligning these models with human preferences—which extend beyond frame fidelity to long-horizon coherence and semantic progression—remains challenging.

Existing alignment methods are inadequate:

  1. Reward-weighted distillation lacks active exploration.
  2. Noise-injection/SDE-based methods (e.g., Flow-GRPO) are ill-suited for AR generators. They break the native ODE formulation, primarily perturb low-level appearance, and induce off-manifold structural interference.
  3. ODE-based methods like NeighborGRPO/AR-CoPO use Euclidean distances in latent space to approximate surrogate policies, which may not faithfully capture the model's intrinsic preferences due to the anisotropic geometry of the generation space.

KVPO addresses these limitations by performing causal-semantic exploration and modeling the surrogate policy within the flow-matching velocity-field space under a pure ODE paradigm.

Methodology

3.1 Preliminaries: Block-wise Autoregressive Video Generation

Streaming AR video generators synthesize videos block-by-block. The generation of block bb is conditioned on the text prompt CC and historical context v<bv_{<b}, materialized as a compressed KV cache K<bK_{<b}.

Under the flow matching framework, the model learns a conditional velocity field vθ(xt,t,K<b)v_\theta(x_t, t, K_{<b}) along the linear interpolation path:

xt=tx0+(1t)xT,xTN(0,I),t[0,1]x_t = t x_0 + (1 - t) x_T, \quad x_T \sim \mathcal{N}(0, I), \quad t \in [0, 1]

At inference, the clean block x0bx_0^b is obtained by integrating the probability flow ODE:

dxtdt=vθ(xt,t,K<b),xt=0=xTN(0,I)\frac{dx_t}{dt} = v_\theta(x_t, t, K_{<b}), \quad x_{t=0} = x_T \sim \mathcal{N}(0, I)

3.2 Causal-Semantic Exploration via Causal History Routing (CHR)

KVPO redirects diversity exploration from noise to the historical KV cache. Since future content is causally conditioned on history, perturbing the composition of the local memory induces semantically diverse generation branches.

CHR Mechanism: At a pivot block bb^*, the sink KV (earliest three frames) remains unchanged. The local memory has a fixed 9-slot layout:

  • Last 3 slots: Always store the most recent frames (KnearK_{near}).
  • First 6 slots: Branch-specific. Stochastically refilled from the older non-sink history.

Let ΩL={4,5,...,L3}\Omega_L = \{4, 5, ..., L-3\} be the routable index set. For each branch g{1,...,G}g \in \{1, ..., G\}, CHR samples six indices r1g,...,r6gΩLr_1^g, ..., r_6^g \in \Omega_L and constructs the branch-specific local cache:

K~<bg,local=[(Kr1g,Vr1g),...,(Kr6g,Vr6g)branch-specific routed 6 slots  ;  Knearshared recent 3 slots](3)\tilde{K}_{<b^*}^{g,\text{local}} = \big[ \underbrace{(K_{r_1^g}, V_{r_1^g}), ..., (K_{r_6^g}, V_{r_6^g})}_{\text{branch-specific routed 6 slots}} \; ; \; \underbrace{K_{near}}_{\text{shared recent 3 slots}} \big] \tag{3}

The attention output for branch gg is:

Attnbg=Softmax(Qbg[Ksink;K~<bg,local;Kbg]dk)[Vsink;V~<bg,local;Vbg](4)\text{Attn}^g_{b^*} = \text{Softmax}\left( \frac{Q^g_{b^*} [K_{\text{sink}} ; \tilde{K}_{<b^*}^{g,\text{local}} ; K^g_{b^*}]^\top}{\sqrt{d_k}} \right) [V_{\text{sink}} ; \tilde{V}_{<b^*}^{g,\text{local}} ; V^g_{b^*}] \tag{4}

Rollout and Replay: Exploration branches within a contiguous window B=[b,b+W)\mathcal{B} = [b^*, b^* + W). CHR is applied only to the first half of the ODE steps (govern coarse semantics). The rollout produces GG branch trajectories {Xg}\{X^g\} with rewards {rg}\{r^g\}, and an anchor trajectory X0X^0. For replay, cached intermediate states zb,sgz^g_{b,s} are reused under the unperturbed context K<bK_{<b} to predict replayed velocities vθ(zb,sg,ts,K<b)v_\theta(z^g_{b,s}, t_s, K_{<b}).

3.3 Velocity-Field Surrogate Policy Modeling and Optimization

Trajectory Velocity Energy (TVE): Defined as the aggregated squared residual between the cached rollout velocity target u^b,sg\hat{u}^g_{b,s} and the replayed velocity for branch XgX^g:

Eθ(Xg)=bBs=1S1dvθ(zb,sg,ts,K<b)u^b,sgF2(5)\mathcal{E}_\theta(X^g) = \sum_{b \in \mathcal{B}} \sum_{s=1}^S \frac{1}{d} \left\| v_\theta\left( z^g_{b,s}, t_s, K_{<b} \right) - \hat{u}^g_{b,s} \right\|_F^2 \tag{5}

A lower TVE indicates a stronger generative tendency towards that branch under the unperturbed context.

Surrogate Policy and Optimization: The Gibbs-form surrogate policy converts TVE into normalized branch probabilities. Let θg=Eθ(Xg)/τ\ell^g_\theta = -\mathcal{E}_\theta(X^g)/\tau. The current and old policies are:

πθ(g)=exp(θg)h=1Gexp(θh),πold(g)=exp(oldg)h=1Gexp(oldh)(6)\pi_\theta(g) = \frac{\exp(\ell^g_\theta)}{\sum_{h=1}^G \exp(\ell^h_\theta)}, \quad \pi_{\text{old}}(g) = \frac{\exp(\ell^g_{\text{old}})}{\sum_{h=1}^G \exp(\ell^h_{\text{old}})} \tag{6}

The generator is updated via the clipped PPO objective:

LPPO(θ)=1Gg=1Gmin(ρgAg,clip(ρg,1ϵlow,1+ϵhigh)Ag)(8)\mathcal{L}_{\text{PPO}}(\theta) = -\frac{1}{G} \sum_{g=1}^G \min\left( \rho^g A^g, \text{clip}(\rho^g, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}) A^g \right) \tag{8}

where ρg=πθ(g)/πold(g)\rho^g = \pi_\theta(g)/\pi_{\text{old}}(g), and the normalized branch advantage AgA^g is:

Ag=rgrˉ1Gk=1G(rkrˉ)2+ϵ,rˉ=1Gk=1Grk,ϵ=108(9)A^g = \frac{r^g - \bar{r}}{\sqrt{\frac{1}{G}\sum_{k=1}^G (r^k - \bar{r})^2 + \epsilon}}, \quad \bar{r} = \frac{1}{G}\sum_{k=1}^G r^k, \quad \epsilon=10^{-8} \tag{9}

Asymmetric clipping (ϵlow=0.1,ϵhigh=0.2\epsilon_{\text{low}}=0.1, \epsilon_{\text{high}}=0.2) aggressively promotes high-reward branches while conservatively suppressing low-reward ones.

Gradient Derivation: The policy gradient derived from TVE reduces to a reward-weighted contrastive flow-matching objective (Eq. 14 in paper), steering the ODE dynamics towards high-reward trajectories and away from low-reward ones.

3.4 Reward Design and Regularization

  • Multi-reward formulation: R=VQ+MQ+TAR = \text{VQ} + \text{MQ} + \text{TA}, using HPSv3 and VideoAlign rewards to mitigate reward hacking.
  • KL Regularization: The total objective includes a KL divergence penalty to prevent excessive drift from the pretrained distribution: Ltotal=LPPO+βDKL(πθπref)(16)\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{PPO}} + \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \tag{16} where πref\pi_{\text{ref}} is a frozen reference policy.

Empirical Validation / Results

Experimental Setup: Evaluated on LongLive and MemFlow AR generators. Compared against Astrolabe. Used multi-prompt VidProM dataset. Applied LoRA fine-tuning (rank 256). Training: 32 H200 GPUs, ~30 hours wall-clock time.

Key Quantitative Results:

Table 1: Comparison of single-prompt short-video and multi-prompt long-video generation.

MethodVQ ↑MQ ↑TA ↑Quality ↑Semantic ↑Consistency Score ↑CLIP Score ↑
Single-prompt short-video generation
LongLive [23]8.861.800.0281.8970.1089.1232.01
+ Astrolabe9.981.870.0381.4170.6189.1432.31
+ KVPO10.21 (⇑15.2%)1.89 (⇑5.0%)0.06 (⇑200.0%)81.4471.4589.5632.29
MemFlow [6]8.831.820.0280.7271.3188.7431.96
+ Astrolabe9.471.850.0380.1371.4288.8732.04
+ KVPO9.71 (⇑9.1%)1.87 (⇑2.7%)0.03 (⇑50.0%)80.9171.6589.0832.17
Multi-prompt long-video generation
LongLive [23]6.341.41-0.1978.4267.8888.3731.90
+ Astrolabe7.261.44-0.1878.4668.3688.3032.18
+ KVPO8.14 (⇑28.4%)1.50 (⇑6.4%)-0.14 (⇑26.3%)79.3169.0288.6232.29
MemFlow [6]6.301.39-0.2077.9568.1187.3431.80
+ Astrolabe6.521.35-0.2378.0267.9487.3531.86
+ KVPO6.96 (⇑10.5%)1.44 (⇑3.6%)-0.17 (⇑15.0%)78.3668.7487.5232.34
  • KVPO achieves consistent improvements across all primary metrics (VQ, MQ, TA) and auxiliary VBench metrics in both settings.
  • Improvements are particularly strong in the long-video setting, where semantic coherence is more critical.
  • KVPO outperforms Astrolabe, with the margin widening for long videos.

Qualitative Results & Human Study:

  • Visual comparisons (Figures 3, 4, and Appendix) show that KVPO yields more faithful prompt grounding, cleaner object interactions, smoother motion, and better cross-segment consistency.
  • A human study (Figure 5) with 32 participants shows KVPO secures a clear majority preference over the baseline and Astrolabe across VQ, MQ, and TA metrics.

Ablation Studies:

Table 2: Ablation of CHR and surrogate policy on LongLive in the multi-prompt long-video setting.

Factor VariantVQ ↑MQ ↑TA ↑
Perturbed blocks 36.921.43-0.18
Perturbed blocks 58.141.50-0.14
Perturbed blocks 78.101.53-0.16
Perturbed local KV slots 3/96.221.36-0.20
Perturbed local KV slots 6/98.141.50-0.14
Perturbed local KV slots 9/96.971.43-0.17
Local KV length Fixed 98.141.50-0.14
Local KV length Random {6,9,12}8.111.48-0.15
Perturbed solver steps 17.121.43-0.18
Perturbed solver steps 28.141.50-0.14
Perturbed solver steps 37.651.51-0.12
Perturbed solver steps 47.411.46-0.17
Surrogate policy Geometric latent 2\ell_26.021.43-0.21
Surrogate policy TVE8.141.50-0.14
  • Optimal CHR configuration: Perturbing 5 blocks, 6 out of 9 local KV slots, and the first 2 denoising steps.
  • TVE is critical: Replacing the TVE-based surrogate policy with a geometric latent 2\ell_2 distance (like NeighborGRPO) causes substantial performance degradation.

Theoretical and Practical Implications

  • Theoretical: KVPO demonstrates that semantic-space exploration via historical context manipulation is a principled, on-manifold alternative to noise-based perturbation for inducing diversity in ODE-based generators. The TVE-based surrogate policy provides a natural bridge between reinforcement learning and the generator's intrinsic flow-matching dynamics, avoiding geometric distortions.
  • Practical: The framework enables effective online preference alignment for state-of-the-art streaming AR video generators, leading to measurable improvements in visual quality, motion realism, and narrative coherence—qualities essential for real-world applications like interactive media and long-form content creation.

Conclusion

KVPO addresses the mismatch between existing RL methods and ODE-based AR video generation by combining Causal History Routing for semantic exploration with a Trajectory Velocity Energy-based surrogate policy. This keeps exploration and optimization within the model's native ODE dynamics. Experiments confirm consistent gains in human-preference alignment.

Future Directions: Extending KVPO to models with different memory mechanisms (e.g., state-space models), reducing computational overhead, and developing stronger reward models for long-horizon consistency.

Key Formulas Preserved:

  1. Flow Matching Path: xt=tx0+(1t)xTx_t = t x_0 + (1 - t) x_T
  2. ODE Integration: dxt/dt=vθ(xt,t,K<b)dx_t/dt = v_\theta(x_t, t, K_{<b})
  3. CHR Local Cache Construction: K~<bg,local=[(Kr1g,Vr1g),...,(Kr6g,Vr6g);Knear]\tilde{K}_{<b^*}^{g,\text{local}} = [ (K_{r_1^g}, V_{r_1^g}), ..., (K_{r_6^g}, V_{r_6^g}) ; K_{near} ]
  4. TVE Definition: Eθ(Xg)=bBs=1S1dvθ(zb,sg,ts,K<b)u^b,sgF2\mathcal{E}_\theta(X^g) = \sum_{b \in \mathcal{B}} \sum_{s=1}^S \frac{1}{d} \| v_\theta( z^g_{b,s}, t_s, K_{<b} ) - \hat{u}^g_{b,s} \|_F^2
  5. Gibbs Policy: πθ(g)=exp(θg)/h=1Gexp(θh)\pi_\theta(g) = \exp(\ell^g_\theta) / \sum_{h=1}^G \exp(\ell^h_\theta)