VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

Summary (Overview)

Latent Geometry Model (LGM): Introduces a lightweight connector that stitches video diffusion latents to geometry foundation models, enabling direct prediction of 4D scene geometry (camera pose, depth, point maps, scene flow) from the latent space without costly VAE decoding.
Latent-Space GRPO: Performs Group Relative Policy Optimization (GRPO) directly in the VAE latent space using geometry-aware rewards, eliminating the computational overhead of repeated RGB decoding required by prior methods.
Dual Reward Design: Proposes two complementary latent-space rewards: a camera motion smoothness reward to penalize jittery trajectories and a geometry reprojection consistency reward to enforce cross-view geometric coherence.
Dynamic Scene Support: By constructing the LGM with a 4D-aware geometry foundation model (e.g., Any4D), VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior alignment methods.
Empirical Superiority: Extensive experiments on static and dynamic benchmarks show VGGRPO consistently improves camera stability, geometric consistency, and overall video quality over baselines, while being more computationally efficient.

Introduction and Theoretical Foundation

Recent large-scale video diffusion models achieve high visual fidelity but often lack 3D and motion consistency, exhibiting geometric drift, unstable camera trajectories, and inconsistent scene structure. This is critical for downstream applications like embodied AI and physics-aware simulation.

Existing approaches to improve consistency follow two paradigms:

Architecture-level integration: Injecting geometric structure via additional conditioning modules or auxiliary losses. This increases complexity and can compromise the generalization of pretrained models.
Post-training alignment: Using reinforcement learning (RL) or Direct Preference Optimization (DPO) with geometry-based rewards. However, these methods typically:
- Rely on RGB-space rewards, requiring repeated VAE decoding which incurs substantial compute/memory overhead.
- Are limited to static scenes due to their underlying geometric assumptions (e.g., epipolar constraints).
- Use offline preference data, leading to off-policy optimization.

Parallel advancements in geometry foundation models (e.g., VGGT, Any4D) demonstrate that feed-forward networks can recover dense geometry and camera motion from image sequences, encoding strong geometric priors.

VGGRPO addresses these limitations by proposing a latent geometry-guided, group-based RL framework for video post-training. The core idea is to leverage geometric priors from foundation models directly in the latent space, avoiding the RGB decoding bottleneck and enabling support for dynamic 4D scenes.

Methodology

VGGRPO comprises two tightly coupled components: (1) the Latent Geometry Model (LGM) and (2) Latent-space GRPO training with geometry-aware rewards.

Preliminaries

Flow-Based GRPO: Formulates the denoising process of rectified flow models as a multi-step Markov Decision Process (MDP). The goal is to maximize expected reward with KL regularization toward a reference policy $\pi_{\text{ref}}$ : $\max_{\theta} \mathbb{E}_{p \sim \mathcal{P}, \mathbf{x}_0 \sim \pi_\theta(\cdot|p)}[r(\mathbf{x}_0, p)] - D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ GRPO samples $K$ trajectories from the current policy, computes a group-relative advantage $A^k$ based on final rewards, and updates the policy using a clipped surrogate objective.
Geometry Foundation Models: A model $\Phi$ takes RGB frames $\{I_i\}_{i=1}^N$ and predicts per-frame geometric outputs $\mathcal{O}_i = \{\mathbf{C}_i, \mathbf{D}_i, \mathbf{P}_i\}$ (camera pose, depth map, 3D point map). Advanced models (e.g., Any4D) also predict scene flow $\mathbf{F}_i$ for 4D dynamic reconstruction.

Latent Geometry Model (LGM)

To bypass RGB decoding, a lightweight connector $S_\psi$ is learned to map VAE latents $\mathbf{z} = E(\mathbf{x})$ into the intermediate feature space of a pretrained geometry model $\Phi$ .

Let $\Phi$ be composed of $L$ transformer layers: $\Phi = T_L \circ T_{L-1} \circ \cdots \circ T_1$ . The LGM is constructed by replacing the first $\hat{\ell}$ layers with the connector:

\hat{\Phi}_\psi = \Phi_{\hat{\ell}+1:L} \circ S_\psi

The stitching layer $\hat{\ell}$ and parameters $\psi$ are found by minimizing feature alignment error on a calibration dataset:

\hat{\ell}, \psi = \arg \min_{\ell \in \{1,...,L\}, \psi} \frac{1}{M} \sum_{m=1}^M \| S_\psi(E(\mathbf{x}_m)) - \Phi_{1:\ell}(\mathbf{x}_m) \|_2^2

The model is then fine-tuned with an alignment loss on the geometric predictions:

\mathcal{L}_{\text{align}}(\psi) = \sum_j \lambda_j \| \hat{\Phi}_{\psi,j}(E(\mathbf{x})) - \Phi_j(\mathbf{x}) \|_1

where $j$ indexes predicted modalities (pose, depth, etc.). The final LGM outputs geometry directly from latents:

\{ \mathbf{C}_i, \mathbf{D}_i, \mathbf{P}_i, \mathbf{F}_i \}_{i=1}^N = \hat{\Phi}_\psi(\mathbf{z})

VGGRPO Training

Using the LGM $\hat{\Phi}_\psi$ , latent-space GRPO is performed with two designed rewards computed from denoised latents $\mathbf{z}_0$ .

Camera Motion Smoothness Reward ( $r_{\text{motion}}$ ): Encourages stable camera trajectories. From predicted poses $\mathbf{C}_i$ , extract camera centers $\mathbf{c}_i$ , velocities $\mathbf{v}_i = \mathbf{c}_{i+1} - \mathbf{c}_i$ , and accelerations $\mathbf{a}_i = \mathbf{v}_i - \mathbf{v}_{i-1}$ .
- Translational smoothness error: $e_{\text{trans}}(\mathbf{z}_0) = \frac{1}{T-2} \sum_{i=2}^{T-1} \frac{\|\mathbf{a}_i\|_2}{\|\mathbf{v}_i\|_2 + \|\mathbf{v}_{i-1}\|_2}$
- Rotational smoothness error: Computed similarly using angular velocities $\boldsymbol{\omega}_i$ and accelerations $\boldsymbol{\alpha}_i$ .
- Combined reward: Maps errors to $[0,1]$ : $r_{\text{motion}}(\mathbf{z}_0) = \frac{1}{2} \left( \frac{1}{1 + e_{\text{trans}}(\mathbf{z}_0)} + \frac{1}{1 + e_{\text{rot}}(\mathbf{z}_0)} \right)$
Geometry Reprojection Consistency Reward ( $r_{\text{geo}}$ ): Enforces cross-view geometric coherence.
- Construct a scene point cloud from predicted point maps $\{\mathbf{P}_i\}$ (using scene flow $\mathbf{F}_i$ to filter dynamic regions for dynamic scenes).
- Reproject the point cloud into each view $i$ using predicted camera $\mathbf{C}_i$ to get a rendered depth map $\hat{\mathbf{D}}_i$ .
- Compute the per-view error vs. predicted depth $\mathbf{D}_i$ : $e^{(i)}_{\text{geo}}(\mathbf{z}_0) = \frac{1}{|\Omega_i|} \sum_{\mathbf{p} \in \Omega_i} | \hat{\mathbf{D}}_i(\mathbf{p}) - \mathbf{D}_i(\mathbf{p}) |$
- The reward is the negated average error over the 3 worst views: $r_{\text{geo}}(\mathbf{z}_0) = -\frac{1}{3} \sum_{i \in \text{top-3}} e^{(i)}_{\text{geo}}(\mathbf{z}_0)$

Alignment Policy Update: For each prompt, sample $K$ latent videos $\{\mathbf{z}_0^k\}_{k=1}^K$ . The group-relative advantage is the average of normalized rewards:

A^k = \frac{1}{2} \left( \frac{r_{\text{motion}}(\mathbf{z}_0^k) - \mu_{\text{motion}}}{\sigma_{\text{motion}}} + \frac{r_{\text{geo}}(\mathbf{z}_0^k) - \mu_{\text{geo}}}{\sigma_{\text{geo}}} \right)

This advantage is substituted into the GRPO objective (Equation (17) in the paper) to update the policy, with all computations performed in latent space.

Empirical Validation / Results

Experimental Setup

LGM: Constructed by stitching to Any4D (a 4D-aware geometry model). Trained on a mixture of generated and real videos.
Base Models: Two text-to-video diffusion backbones fine-tuned: Wan2.1-1B and Wan2.2-5B.
Benchmarks: 190 static-scene and 200 dynamic-scene captions from DL3DV, RealEstate10K, and MiraData.
Baselines: Base Model, Supervised Fine-Tuning (SFT), Epipolar-DPO, VideoGPA.

Main Results

Quantitative Evaluation (Key Table)

Method	Static	Dynamic	Sub. Cons. ↑	Bg. Cons. ↑	Aes. Qual. ↑	Img. Qual. ↑	Mot. Smooth. ↑	Dyn. Deg. ↑
Base Model: Wan2.1-1B
Base	-	-	0.133	-	-	0.7941	0.8930	0.5233
SFT	45.26	46.84	0.137	40.00	39.00	0.8032	0.8896	0.5472
Epipolar-DPO	54.21	55.79	0.098	45.50	43.00	0.8125	0.8916	0.5578
VideoGPA	53.68	56.32	0.105	42.50	41.00	0.8068	0.8931	0.5562
VGGRPO (Ours)	59.47	66.84	0.102	57.00	63.00	0.8255	0.8974	0.5623
Base Model: Wan2.2-5B
Base	-	-	0.142	-	-	0.8151	0.8958	0.4837
SFT	46.32	52.63	0.129	33.00	51.00	0.8323	0.8925	0.4886
Epipolar-DPO	52.11	58.95	0.101	38.00	54.50	0.8407	0.9054	0.4945
VideoGPA	54.74	60.53	0.098	40.00	54.00	0.8511	0.9048	0.4920
VGGRPO (Ours)	62.63	68.42	0.093	56.50	66.00	0.8672	0.9056	0.5094

Table 1: Quantitative Comparison. VGGRPO consistently outperforms baselines on geometry-related metrics (VideoReward win rates for Visual Quality (VQ) and Motion Quality (MQ) on static/dynamic splits) and general VBench metrics across different base models.

Key Findings:

VGGRPO achieves higher motion quality and geometric consistency across both static and dynamic splits.
Prior geometry-aligned baselines degrade on the dynamic benchmark with complex non-rigid motion, while VGGRPO remains robust.
VGGRPO also improves general VBench metrics, indicating improved geometric consistency does not come at the expense of perceptual quality.

Qualitative Results: Visual comparisons show baselines exhibit geometric drift, temporal flicker, and unstable camera motion. VGGRPO produces more coherent scene structure and smoother camera trajectories in both static and dynamic settings.

Additional Studies

Impact of Reward Components

$r_{\text{motion}}$	$r_{\text{geo}}$	VQ ↑	MQ ↑	Epi. ↓
✓		55.60	63.40	0.104
✓	✓	59.57	67.21	0.093

Table 2b: $r_{\text{motion}}$ stabilizes camera motion, while adding $r_{\text{geo}}$ further improves scene geometry, confirming their complementarity.

Efficiency Study

Reward	Time (s) ↓	Mem (GB) ↓
RGB-based	54.73	76.80
Ours	41.33	68.57

Table 2e: Latent-space reward computation reduces runtime by 24.5% and peak GPU memory compared to RGB-based rewarding.

Other Findings:

Generalization: VGGRPO improves performance on standard VBench captions, indicating preserved general-purpose generation quality.
Test-Time Guidance: The differentiable LGM enables gradient-based test-time guidance in latent space to improve geometry without training.
LGM Robustness: The latent geometry model is more robust to perturbations in the latent space compared to RGB-based geometry models applied to decoded frames, avoiding a distribution gap.

Theoretical and Practical Implications

Theoretical: Demonstrates that reliable geometry-driven rewards can be computed directly in latent space, enabling efficient on-policy RL for video generation. The LGM provides a principled method to bridge the representation gap between generative and discriminative (geometry) models.
Practical: Provides an efficient and flexible post-training framework that:
- Preserves pretrained capacity by using lightweight LoRA adaptation and KL regularization.
- Eliminates the RGB decoding bottleneck, significantly reducing compute and memory overhead.
- Supports dynamic 4D scenes, expanding the applicability of geometry-aware alignment to real-world videos with motion.
- Serves as a broadly applicable regularizer that improves world consistency while maintaining or enhancing general video quality.

Conclusion

VGGRPO introduces a latent geometry-guided framework for world-consistent video generation. By constructing a Latent Geometry Model and performing latent-space GRPO with complementary camera motion and geometry reprojection rewards, it aligns pretrained video diffusion models toward 4D world consistency for both static and dynamic scenes.

Key takeaways:

Latent-space geometry rewards enable efficient post-training without repeated VAE decoding.
The dual reward design effectively improves camera smoothness and cross-view geometric coherence.
The method generalizes across different base models and geometry foundations, and maintains strong visual fidelity.

Future directions may include extending the framework to other generative modalities, incorporating more complex physical constraints, and exploring the use of the LGM for other tasks like 4D reconstruction from generated videos.