KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Summary (Overview)

KVPO is a novel online Group Relative Policy Optimization (GRPO) framework designed to align streaming autoregressive (AR) video generators with human preferences while respecting their native deterministic ODE dynamics.
It introduces Causal-Semantic Exploration via Causal History Routing (CHR), which generates diverse candidate branches by stochastically routing historical Key-Value (KV) cache entries, ensuring on-manifold and semantically meaningful variation.
It defines an ODE-native surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in the flow-matching velocity-field space, leading to a reward-weighted contrastive flow-matching objective.
Experiments on distilled AR models (LongLive, MemFlow) show consistent improvements in Visual Quality (VQ), Motion Quality (MQ), and Text-Video Alignment (TA) for both short and long video generation, outperforming prior methods like Astrolabe.
The method avoids the pitfalls of noise-based SDE exploration (off-manifold distortion, low-level perturbation) and geometric distance-based surrogate policies, offering a principled alternative for ODE-based preference alignment.

Introduction and Theoretical Foundation

Recent advances have distilled pretrained video diffusion models into efficient, few-step autoregressive (AR) video generators that enable streaming inference via causal attention and KV caching. However, aligning these models with human preferences—which extend beyond frame fidelity to long-horizon coherence and semantic progression—remains challenging.

Existing alignment methods are inadequate:

Reward-weighted distillation lacks active exploration.
Noise-injection/SDE-based methods (e.g., Flow-GRPO) are ill-suited for AR generators. They break the native ODE formulation, primarily perturb low-level appearance, and induce off-manifold structural interference.
ODE-based methods like NeighborGRPO/AR-CoPO use Euclidean distances in latent space to approximate surrogate policies, which may not faithfully capture the model's intrinsic preferences due to the anisotropic geometry of the generation space.

KVPO addresses these limitations by performing causal-semantic exploration and modeling the surrogate policy within the flow-matching velocity-field space under a pure ODE paradigm.

Methodology

3.1 Preliminaries: Block-wise Autoregressive Video Generation

Streaming AR video generators synthesize videos block-by-block. The generation of block $b$ is conditioned on the text prompt $C$ and historical context $v_{<b}$ , materialized as a compressed KV cache $K_{<b}$ .

Under the flow matching framework, the model learns a conditional velocity field $v_\theta(x_t, t, K_{<b})$ along the linear interpolation path:

x_t = t x_0 + (1 - t) x_T, \quad x_T \sim \mathcal{N}(0, I), \quad t \in [0, 1]

At inference, the clean block $x_0^b$ is obtained by integrating the probability flow ODE:

\frac{dx_t}{dt} = v_\theta(x_t, t, K_{<b}), \quad x_{t=0} = x_T \sim \mathcal{N}(0, I)

3.2 Causal-Semantic Exploration via Causal History Routing (CHR)

KVPO redirects diversity exploration from noise to the historical KV cache. Since future content is causally conditioned on history, perturbing the composition of the local memory induces semantically diverse generation branches.

CHR Mechanism: At a pivot block $b^*$ , the sink KV (earliest three frames) remains unchanged. The local memory has a fixed 9-slot layout:

Last 3 slots: Always store the most recent frames ( $K_{near}$ ).
First 6 slots: Branch-specific. Stochastically refilled from the older non-sink history.

Let $\Omega_L = \{4, 5, ..., L-3\}$ be the routable index set. For each branch $g \in \{1, ..., G\}$ , CHR samples six indices $r_1^g, ..., r_6^g \in \Omega_L$ and constructs the branch-specific local cache:

\tilde{K}_{<b^*}^{g,\text{local}} = \big[ \underbrace{(K_{r_1^g}, V_{r_1^g}), ..., (K_{r_6^g}, V_{r_6^g})}_{\text{branch-specific routed 6 slots}} \; ; \; \underbrace{K_{near}}_{\text{shared recent 3 slots}} \big] \tag{3}

The attention output for branch $g$ is:

\text{Attn}^g_{b^*} = \text{Softmax}\left( \frac{Q^g_{b^*} [K_{\text{sink}} ; \tilde{K}_{<b^*}^{g,\text{local}} ; K^g_{b^*}]^\top}{\sqrt{d_k}} \right) [V_{\text{sink}} ; \tilde{V}_{<b^*}^{g,\text{local}} ; V^g_{b^*}] \tag{4}

Rollout and Replay: Exploration branches within a contiguous window $\mathcal{B} = [b^*, b^* + W)$ . CHR is applied only to the first half of the ODE steps (govern coarse semantics). The rollout produces $G$ branch trajectories $\{X^g\}$ with rewards $\{r^g\}$ , and an anchor trajectory $X^0$ . For replay, cached intermediate states $z^g_{b,s}$ are reused under the unperturbed context $K_{<b}$ to predict replayed velocities $v_\theta(z^g_{b,s}, t_s, K_{<b})$ .

3.3 Velocity-Field Surrogate Policy Modeling and Optimization

Trajectory Velocity Energy (TVE): Defined as the aggregated squared residual between the cached rollout velocity target $\hat{u}^g_{b,s}$ and the replayed velocity for branch $X^g$ :

\mathcal{E}_\theta(X^g) = \sum_{b \in \mathcal{B}} \sum_{s=1}^S \frac{1}{d} \left\| v_\theta\left( z^g_{b,s}, t_s, K_{<b} \right) - \hat{u}^g_{b,s} \right\|_F^2 \tag{5}

A lower TVE indicates a stronger generative tendency towards that branch under the unperturbed context.

Surrogate Policy and Optimization: The Gibbs-form surrogate policy converts TVE into normalized branch probabilities. Let $\ell^g_\theta = -\mathcal{E}_\theta(X^g)/\tau$ . The current and old policies are:

\pi_\theta(g) = \frac{\exp(\ell^g_\theta)}{\sum_{h=1}^G \exp(\ell^h_\theta)}, \quad \pi_{\text{old}}(g) = \frac{\exp(\ell^g_{\text{old}})}{\sum_{h=1}^G \exp(\ell^h_{\text{old}})} \tag{6}

The generator is updated via the clipped PPO objective:

\mathcal{L}_{\text{PPO}}(\theta) = -\frac{1}{G} \sum_{g=1}^G \min\left( \rho^g A^g, \text{clip}(\rho^g, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}) A^g \right) \tag{8}

where $\rho^g = \pi_\theta(g)/\pi_{\text{old}}(g)$ , and the normalized branch advantage $A^g$ is:

A^g = \frac{r^g - \bar{r}}{\sqrt{\frac{1}{G}\sum_{k=1}^G (r^k - \bar{r})^2 + \epsilon}}, \quad \bar{r} = \frac{1}{G}\sum_{k=1}^G r^k, \quad \epsilon=10^{-8} \tag{9}

Asymmetric clipping ( $\epsilon_{\text{low}}=0.1, \epsilon_{\text{high}}=0.2$ ) aggressively promotes high-reward branches while conservatively suppressing low-reward ones.

Gradient Derivation: The policy gradient derived from TVE reduces to a reward-weighted contrastive flow-matching objective (Eq. 14 in paper), steering the ODE dynamics towards high-reward trajectories and away from low-reward ones.

3.4 Reward Design and Regularization

Multi-reward formulation: $R = \text{VQ} + \text{MQ} + \text{TA}$ , using HPSv3 and VideoAlign rewards to mitigate reward hacking.
KL Regularization: The total objective includes a KL divergence penalty to prevent excessive drift from the pretrained distribution: $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{PPO}} + \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \tag{16}$ where $\pi_{\text{ref}}$ is a frozen reference policy.

Empirical Validation / Results

Experimental Setup: Evaluated on LongLive and MemFlow AR generators. Compared against Astrolabe. Used multi-prompt VidProM dataset. Applied LoRA fine-tuning (rank 256). Training: 32 H200 GPUs, ~30 hours wall-clock time.

Key Quantitative Results:

Table 1: Comparison of single-prompt short-video and multi-prompt long-video generation.

Method	VQ ↑	MQ ↑	TA ↑	Quality ↑	Semantic ↑	Consistency Score ↑	CLIP Score ↑
Single-prompt short-video generation
LongLive [23]	8.86	1.80	0.02	81.89	70.10	89.12	32.01
+ Astrolabe	9.98	1.87	0.03	81.41	70.61	89.14	32.31
+ KVPO	10.21 (⇑15.2%)	1.89 (⇑5.0%)	0.06 (⇑200.0%)	81.44	71.45	89.56	32.29
MemFlow [6]	8.83	1.82	0.02	80.72	71.31	88.74	31.96
+ Astrolabe	9.47	1.85	0.03	80.13	71.42	88.87	32.04
+ KVPO	9.71 (⇑9.1%)	1.87 (⇑2.7%)	0.03 (⇑50.0%)	80.91	71.65	89.08	32.17
Multi-prompt long-video generation
LongLive [23]	6.34	1.41	-0.19	78.42	67.88	88.37	31.90
+ Astrolabe	7.26	1.44	-0.18	78.46	68.36	88.30	32.18
+ KVPO	8.14 (⇑28.4%)	1.50 (⇑6.4%)	-0.14 (⇑26.3%)	79.31	69.02	88.62	32.29
MemFlow [6]	6.30	1.39	-0.20	77.95	68.11	87.34	31.80
+ Astrolabe	6.52	1.35	-0.23	78.02	67.94	87.35	31.86
+ KVPO	6.96 (⇑10.5%)	1.44 (⇑3.6%)	-0.17 (⇑15.0%)	78.36	68.74	87.52	32.34

KVPO achieves consistent improvements across all primary metrics (VQ, MQ, TA) and auxiliary VBench metrics in both settings.
Improvements are particularly strong in the long-video setting, where semantic coherence is more critical.
KVPO outperforms Astrolabe, with the margin widening for long videos.

Qualitative Results & Human Study:

Visual comparisons (Figures 3, 4, and Appendix) show that KVPO yields more faithful prompt grounding, cleaner object interactions, smoother motion, and better cross-segment consistency.
A human study (Figure 5) with 32 participants shows KVPO secures a clear majority preference over the baseline and Astrolabe across VQ, MQ, and TA metrics.

Ablation Studies:

Table 2: Ablation of CHR and surrogate policy on LongLive in the multi-prompt long-video setting.

Factor Variant	VQ ↑	MQ ↑	TA ↑
Perturbed blocks 3	6.92	1.43	-0.18
Perturbed blocks 5	8.14	1.50	-0.14
Perturbed blocks 7	8.10	1.53	-0.16
Perturbed local KV slots 3/9	6.22	1.36	-0.20
Perturbed local KV slots 6/9	8.14	1.50	-0.14
Perturbed local KV slots 9/9	6.97	1.43	-0.17
Local KV length Fixed 9	8.14	1.50	-0.14
Local KV length Random {6,9,12}	8.11	1.48	-0.15
Perturbed solver steps 1	7.12	1.43	-0.18
Perturbed solver steps 2	8.14	1.50	-0.14
Perturbed solver steps 3	7.65	1.51	-0.12
Perturbed solver steps 4	7.41	1.46	-0.17
Surrogate policy Geometric latent $\ell_2$	6.02	1.43	-0.21
Surrogate policy TVE	8.14	1.50	-0.14

Optimal CHR configuration: Perturbing 5 blocks, 6 out of 9 local KV slots, and the first 2 denoising steps.
TVE is critical: Replacing the TVE-based surrogate policy with a geometric latent $\ell_2$ distance (like NeighborGRPO) causes substantial performance degradation.

Theoretical and Practical Implications

Theoretical: KVPO demonstrates that semantic-space exploration via historical context manipulation is a principled, on-manifold alternative to noise-based perturbation for inducing diversity in ODE-based generators. The TVE-based surrogate policy provides a natural bridge between reinforcement learning and the generator's intrinsic flow-matching dynamics, avoiding geometric distortions.
Practical: The framework enables effective online preference alignment for state-of-the-art streaming AR video generators, leading to measurable improvements in visual quality, motion realism, and narrative coherence—qualities essential for real-world applications like interactive media and long-form content creation.

Conclusion

KVPO addresses the mismatch between existing RL methods and ODE-based AR video generation by combining Causal History Routing for semantic exploration with a Trajectory Velocity Energy-based surrogate policy. This keeps exploration and optimization within the model's native ODE dynamics. Experiments confirm consistent gains in human-preference alignment.

Future Directions: Extending KVPO to models with different memory mechanisms (e.g., state-space models), reducing computational overhead, and developing stronger reward models for long-horizon consistency.

Key Formulas Preserved:

Flow Matching Path: $x_t = t x_0 + (1 - t) x_T$
ODE Integration: $dx_t/dt = v_\theta(x_t, t, K_{<b})$
CHR Local Cache Construction: $\tilde{K}_{<b^*}^{g,\text{local}} = [ (K_{r_1^g}, V_{r_1^g}), ..., (K_{r_6^g}, V_{r_6^g}) ; K_{near} ]$
TVE Definition: $\mathcal{E}_\theta(X^g) = \sum_{b \in \mathcal{B}} \sum_{s=1}^S \frac{1}{d} \| v_\theta( z^g_{b,s}, t_s, K_{<b} ) - \hat{u}^g_{b,s} \|_F^2$
Gibbs Policy: $\pi_\theta(g) = \exp(\ell^g_\theta) / \sum_{h=1}^G \exp(\ell^h_\theta)$