Manifold-Aware Exploration for Reinforcement Learning in Video Generation
Summary (Overview)
- Core Problem: Group Relative Policy Optimization (GRPO) for video generation is less reliable than for images/language due to complex solution spaces. Converting deterministic ODE samplers to SDE for exploration injects excess noise, degrading rollout quality and destabilizing reward-guided alignment.
- Key Solution: Formulate GRPO as a manifold-constrained exploration problem. The pre-trained model defines a valid video data manifold ; the goal is to constrain exploration to its vicinity to preserve rollout validity and reward reliability.
- Micro-Level Innovation: Propose a Precise Manifold-Aware SDE with a logarithmic curvature correction for accurate noise variance, coupled with a Gradient Norm Equalizer to balance optimization pressure across timesteps.
- Macro-Level Innovation: Introduce a Dual Trust Region mechanism combining a periodic moving anchor (position control) and stepwise constraints (velocity control) to prevent long-horizon policy drift while maintaining plasticity.
- Empirical Result: SAGE-GRPO demonstrates consistent gains over baselines (DanceGRPO, FlowGRPO, CPS) on HunyuanVideo1.5 using the VideoAlign reward model, in both reward maximization (VQ, MQ, TA) and visual metrics (CLIPScore, PickScore).
Introduction and Theoretical Foundation
Group Relative Policy Optimization (GRPO) has become a standard method for aligning generative models (diffusion/flow matching) with reward signals. However, its application to video generation remains significantly less reliable than for language models and images. This performance gap stems from the inherently large and structured solution space of videos.
The standard GRPO training procedure requires converting a deterministic ODE sampler into a stochastic SDE sampler to enable exploration through diverse samples. Current video GRPO baselines (e.g., DanceGRPO, FlowGRPO) derive the SDE noise standard deviation using Euler-style discretization and first-order approximations. This introduces first-order truncation error, injecting excess noise energy during sampling (see Figure 1a.1). This excess noise:
- Lowers the quality of generated rollouts, especially in high-noise steps.
- Makes reward evaluations less reliable, destabilizing the post-training alignment process.
The authors reframe the core problem. A pre-trained video generation model parameterized by defines a valid data manifold . While the initial parameters are insufficient for the target reward, GRPO must update through exploration while keeping trajectories within the vicinity of to ensure rollouts remain valid. Conventional SDE exploration can overestimate noise variance, pushing states away from and producing temporal artifacts (see Figure 2, red trajectory).
Thus, the central challenge is: How to constrain exploration within the vicinity of the data manifold to improve rollouts while keeping reward evaluation reliable?
Methodology
SAGE-GRPO (Stable Alignment via Exploration) addresses this challenge with a unified strategy operating at both micro (sampling) and macro (policy optimization) levels.
3.1 Preliminaries: Flow Matching and GRPO
- Flow Matching / Rectified Flow: Generation is modeled as transport along a probability path via an ODE: Using the linear interpolation path , the velocity field is:
- Group Relative Policy Optimization (GRPO): For a prompt , GRPO samples a group of rollouts and optimizes the policy using a group-normalized advantage: where is the normalized advantage for rollout .
3.2 SAGE-GRPO Framework
3.2.1 Micro-Level Exploration: Precise SDE and Gradient Equalization
The goal is to perturb the Rectified Flow ODE with a marginal-preserving SDE whose noise stays aligned with the manifold . The critical step is computing the correct noise standard deviation .
For a marginal-preserving SDE with diffusion coefficient (where is an exploration scaling factor), the integrated variance over the interval is derived as:
The logarithmic term accounts for the geometric contraction of the signal coefficient , which linear approximations fail to capture. The noise standard deviation is:
Applying Euler-Maruyama discretization with timestep :
where is the score function estimate. The Itô correction term ensures consistency with Rectified Flow marginals.
Gradient Norm Equalizer: Even with the corrected SDE, an inherent signal-to-noise imbalance exists across timesteps. For a Gaussian transition , the gradient norm scales as:
This causes gradients to vanish at high noise () and explode at low noise (), biasing learning. To counteract this, a per-timestep gradient scale is estimated from SDE parameters, and a robust normalization is applied:
where is a small constant. This equalization normalizes optimization pressure across timesteps.
3.2.2 Macro-Level Exploration: Dual Trust Region Optimization
To prevent long-horizon policy drift from the manifold, KL divergence is used as a dynamic anchoring mechanism.
- KL Divergence as Anchor: For a Gaussian policy, . The choice of reference policy determines the constraint nature.
- Fixed KL: Uses (initial model). This is a hard constraint that can prevent reaching the optimal policy if it is far from , leading to underfitting.
- Step-wise KL: Uses (previous step's policy). This acts as a velocity limit, restricting update magnitude per step but does not bound cumulative displacement , allowing unbounded drift over many steps.
- Periodical Moving KL: Uses , updated every steps. This provides position control by creating a dynamic trust region that periodically resets the safe zone to a more manifold-consistent policy, enabling staged exploration.
- Dual KL - Position-Velocity Controller: Combines both mechanisms: The position term prevents long-term drift; the velocity term smooths instantaneous updates.
The full SAGE-GRPO objective combines GRPO, temporal equalization, and the Dual KL regularizer:
An adaptive schedule for is used, warming up from to over 100 steps, then applying conservative feedback control.
Empirical Validation / Results
Experimental Setup: All experiments are conducted on HunyuanVideo 1.5. The VideoAlign evaluator (Liu et al., 2025c) is used as the frozen reward oracle, providing scores for Visual Quality (VQ), Motion Quality (MQ), and Text Alignment (TA). The overall reward is . Baselines include DanceGRPO, FlowGRPO, and CPS.
Main Results (Table 2): SAGE-GRPO is evaluated under two reward configurations:
- Setting A (Averaged):
- Setting B (Alignment-Focused):
Key Findings:
- Under the alignment-focused setting (B), SAGE-GRPO with Dual Moving KL achieves the best performance in Overall reward, VQ, MQ, and CLIPScore, while remaining competitive in TA and PickScore.
- Emphasizing text alignment (Setting B) provides a more reliable optimization target and yields more stable gains across both reward and visual metrics compared to the averaged setting.
- The Dual Moving KL mechanism is crucial for achieving high and stable rewards while maintaining exploration (see Figure 8).
Table 2: Main Comparison on Video Generation Benchmarks
| Method | Configuration | Overall | VQ | MQ | TA | CLIPScore | PickScore |
|---|---|---|---|---|---|---|---|
| HunyuanVideo 1.5 (Original) | - | 0.0654 | -0.7539 | -0.5870 | 1.4063 | 0.5409 | 0.7397 |
| Setting A: Averaged Rewards | |||||||
| DanceGRPO | w/o KL | 0.2768 | -0.7589 | -0.3852 | 1.4209 | 0.5386 | 0.7378 |
| w/ Fixed KL | 0.0979 | -0.8077 | -0.5091 | 1.4147 | 0.5403 | 0.7355 | |
| FlowGRPO | w/o KL | 0.2733 | -0.7151 | -0.5286 | 1.5170 | 0.5443 | 0.7394 |
| w/ Fixed KL | 0.1880 | -0.6771 | -0.5912 | 1.4563 | 0.5431 | 0.7407 | |
| CPS | w/o KL | 0.6343 | -0.4855 | -0.4021 | 1.5219 | 0.5479 | 0.7412 |
| w/ Fixed KL | 0.0928 | -0.7156 | -0.5825 | 1.3908 | 0.5479 | 0.7369 | |
| SAGE-GRPO | w/o KL | 0.4859 | -0.6104 | -0.4141 | 1.5104 | 0.5423 | 0.7360 |
| w/ Fixed KL | 0.2244 | -0.7438 | -0.5320 | 1.5001 | 0.5446 | 0.7382 | |
| w/ Dual Mov KL | 0.2173 | -0.7881 | -0.4249 | 1.4303 | 0.5430 | 0.7452 | |
| Setting B: Alignment-Focused | |||||||
| DanceGRPO | w/o KL | -0.2172 | -0.8854 | -0.6218 | 1.2901 | 0.5439 | 0.7352 |
| w/ Fixed KL | 0.1290 | -0.7739 | -0.5083 | 1.4112 | 0.5452 | 0.7276 | |
| FlowGRPO | w/o KL | 0.4773 | -0.5671 | -0.4731 | 1.5175 | 0.5403 | 0.7349 |
| w/ Fixed KL | 0.2103 | -0.6654 | -0.5506 | 1.4263 | 0.5427 | 0.7408 | |
| CPS | w/o KL | 0.3694 | -0.6650 | -0.5325 | 1.5669 | 0.5479 | 0.7311 |
| w/ Fixed KL | 0.3705 | -0.6121 | -0.4787 | 1.4613 | 0.5458 | 0.7364 | |
| SAGE-GRPO | w/o KL | -0.1222 | -0.8720 | -0.6046 | 1.3544 | 0.5404 | 0.7357 |
| w/ Fixed KL | 0.2857 | -0.7062 | -0.4425 | 1.4344 | 0.5414 | 0.7377 | |
| w/ Dual Mov KL | 0.8066 | -0.4765 | -0.2384 | 1.5216 | 0.5484 | 0.7420 | |
| Bold, underline, and gray indicate the best, second best, and third best results across both settings (A+B). |
User Study (Table 3): A pairwise preference study with 29 evaluators on 32 prompts shows strong human preference for SAGE-GRPO over all baselines, especially in Motion Quality.
| SAGE-GRPO vs. | Visual Quality | Motion Quality | Semantic Alignment |
|---|---|---|---|
| DanceGRPO | 85.9% | 75.8% | 79.2% |
| FlowGRPO | 83.8% | 79.2% | 71.9% |
| CPS | 80.2% | 70.8% | 67.9% |
Ablation Studies:
- Temporal Gradient Equalizer (Figure 3): Applying the equalizer leads to smoother reward curves with consistent improvement, reducing gradient scale variation by over an order of magnitude.
- KL Strategy (Figure 8): Dual Moving KL achieves the highest and most stable final reward while maintaining a stable exploration level, validating the position-velocity controller design.
- KL Weight Sensitivity (Figure 7): A two-stage schedule increasing from to yields the strongest and most consistent gains across VQ, MQ, and TA.
- Qualitative Analysis (Figures 6, 10, 11, 12): SAGE-GRPO generates videos with reduced temporal jitter, enhanced photorealism and alignment under complex conditions (occlusion, lighting), stronger semantic alignment, and better capture of subtle emotional cues described in prompts.
Theoretical and Practical Implications
Theoretical Implications:
- Manifold-Constrained RL: Provides a novel geometric perspective for RL in high-dimensional generative tasks, framing exploration as a problem of staying near a data manifold defined by a pre-trained model.
- Precise Stochastic Discretization: Derives a corrected SDE variance via integration and a logarithmic term, moving beyond first-order approximations common in the field.
- Dual Trust Regions: Introduces a principled combination of position and velocity control via KL divergence, offering a solution to the stability-plasticity dilemma in generative model alignment.
Practical Implications:
- Stable Video Alignment: SAGE-GRPO enables more reliable and effective application of GRPO to large-scale video generation models,