Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Summary (Overview)

Core Problem: Group Relative Policy Optimization (GRPO) for video generation is less reliable than for images/language due to complex solution spaces. Converting deterministic ODE samplers to SDE for exploration injects excess noise, degrading rollout quality and destabilizing reward-guided alignment.
Key Solution: Formulate GRPO as a manifold-constrained exploration problem. The pre-trained model defines a valid video data manifold $\mathcal{M}$ ; the goal is to constrain exploration to its vicinity to preserve rollout validity and reward reliability.
Micro-Level Innovation: Propose a Precise Manifold-Aware SDE with a logarithmic curvature correction for accurate noise variance, coupled with a Gradient Norm Equalizer to balance optimization pressure across timesteps.
Macro-Level Innovation: Introduce a Dual Trust Region mechanism combining a periodic moving anchor (position control) and stepwise constraints (velocity control) to prevent long-horizon policy drift while maintaining plasticity.
Empirical Result: SAGE-GRPO demonstrates consistent gains over baselines (DanceGRPO, FlowGRPO, CPS) on HunyuanVideo1.5 using the VideoAlign reward model, in both reward maximization (VQ, MQ, TA) and visual metrics (CLIPScore, PickScore).

Introduction and Theoretical Foundation

Group Relative Policy Optimization (GRPO) has become a standard method for aligning generative models (diffusion/flow matching) with reward signals. However, its application to video generation remains significantly less reliable than for language models and images. This performance gap stems from the inherently large and structured solution space of videos.

The standard GRPO training procedure requires converting a deterministic ODE sampler into a stochastic SDE sampler to enable exploration through diverse samples. Current video GRPO baselines (e.g., DanceGRPO, FlowGRPO) derive the SDE noise standard deviation using Euler-style discretization and first-order approximations. This introduces first-order truncation error, injecting excess noise energy during sampling (see Figure 1a.1). This excess noise:

Lowers the quality of generated rollouts, especially in high-noise steps.
Makes reward evaluations less reliable, destabilizing the post-training alignment process.

The authors reframe the core problem. A pre-trained video generation model parameterized by $\theta$ defines a valid data manifold $\mathcal{M} \subset \mathbb{R}^D$ . While the initial parameters $\theta_0$ are insufficient for the target reward, GRPO must update $\theta$ through exploration while keeping trajectories within the vicinity of $\mathcal{M}$ to ensure rollouts remain valid. Conventional SDE exploration can overestimate noise variance, pushing states $z_t$ away from $\mathcal{M}$ and producing temporal artifacts (see Figure 2, red trajectory).

Thus, the central challenge is: How to constrain exploration within the vicinity of the data manifold to improve rollouts while keeping reward evaluation reliable?

Methodology

SAGE-GRPO (Stable Alignment via Exploration) addresses this challenge with a unified strategy operating at both micro (sampling) and macro (policy optimization) levels.

3.1 Preliminaries: Flow Matching and GRPO

Flow Matching / Rectified Flow: Generation is modeled as transport along a probability path $p_t(x)$ via an ODE: $\frac{dx_t}{dt} = v_\theta(x_t, t)$ Using the linear interpolation path $x_t = (1 - \sigma_t)x_0 + \sigma_t z_1$ , the velocity field is: $v_\theta(x_t, t) = \frac{dx_t}{dt} = -\frac{d\sigma_t}{dt}(x_0 - z_1) = \frac{1}{1-\sigma_t}(x_t - x_0)$
Group Relative Policy Optimization (GRPO): For a prompt $c$ , GRPO samples a group of $G$ rollouts and optimizes the policy using a group-normalized advantage: $\mathcal{L}_{GRPO}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} A_i \cdot \sum_{t=1}^{T} \log \pi_\theta(x^{(i)}_{t-1} | x^{(i)}_t, c)$ where $A_i$ is the normalized advantage for rollout $i$ .

3.2 SAGE-GRPO Framework

3.2.1 Micro-Level Exploration: Precise SDE and Gradient Equalization

The goal is to perturb the Rectified Flow ODE with a marginal-preserving SDE whose noise stays aligned with the manifold $\mathcal{M}$ . The critical step is computing the correct noise standard deviation $\Sigma^{1/2}_t$ .

For a marginal-preserving SDE with diffusion coefficient $\varepsilon_t = \eta \sqrt{\sigma_t / (1-\sigma_t)}$ (where $\eta$ is an exploration scaling factor), the integrated variance over the interval $[\sigma_{t+1}, \sigma_t]$ is derived as:

\Sigma_t = \int_{\sigma_t}^{\sigma_{t+1}} \varepsilon^2_s ds = \eta^2 \left[ -(\sigma_t - \sigma_{t+1}) + \log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right) \right]

The logarithmic term $\log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right)$ accounts for the geometric contraction of the signal coefficient $(1-\sigma_t)$ , which linear approximations fail to capture. The noise standard deviation is:

\Sigma^{1/2}_t = \eta \sqrt{ -(\sigma_t - \sigma_{t+1}) + \log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right) }

Applying Euler-Maruyama discretization with timestep $\Delta t = \sigma_t - \sigma_{t+1}$ :

x_{t+\Delta t} = x_t + v_\theta(x_t, t)\Delta t + \frac{\Sigma_t}{2} s_\theta(x_t) + \Sigma^{1/2}_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where $s_\theta(x_t) \approx -(x_t - \hat{x}_0)/\sigma^2_t$ is the score function estimate. The Itô correction term $\frac{\Sigma_t}{2} s_\theta(x_t)$ ensures consistency with Rectified Flow marginals.

Gradient Norm Equalizer: Even with the corrected SDE, an inherent signal-to-noise imbalance exists across timesteps. For a Gaussian transition $\pi(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \Sigma_t I)$ , the gradient norm scales as:

\|\nabla_\mu \log \pi\| \propto \frac{1}{\Sigma^{1/2}_t}

This causes gradients to vanish at high noise ( $t \to 1$ ) and explode at low noise ( $t \to 0$ ), biasing learning. To counteract this, a per-timestep gradient scale $N_t$ is estimated from SDE parameters, and a robust normalization is applied:

S_t = \frac{\text{Median}(\{N_\tau\}_{\tau=1}^T)}{N_t + \epsilon}

where $\epsilon$ is a small constant. This equalization normalizes optimization pressure across timesteps.

3.2.2 Macro-Level Exploration: Dual Trust Region Optimization

To prevent long-horizon policy drift from the manifold, KL divergence is used as a dynamic anchoring mechanism.

KL Divergence as Anchor: For a Gaussian policy, $D_{KL}(\pi_\theta \| \pi_{ref}) \approx \frac{(\mu_\theta - \mu_{ref})^2}{2\Sigma^2_t}$ . The choice of reference policy $\pi_{ref}$ determines the constraint nature.
Fixed KL: Uses $\pi_{ref} = \pi_0$ (initial model). This is a hard constraint that can prevent reaching the optimal policy $\pi^*$ if it is far from $\pi_0$ , leading to underfitting.
Step-wise KL: Uses $\pi_{ref} = \pi_{k-1}$ (previous step's policy). This acts as a velocity limit, restricting update magnitude per step but does not bound cumulative displacement $\|\theta_k - \theta_0\|$ , allowing unbounded drift over many steps.
Periodical Moving KL: Uses $\pi_{ref} = \pi_{k-N}$ , updated every $N$ steps. This provides position control by creating a dynamic trust region that periodically resets the safe zone to a more manifold-consistent policy, enabling staged exploration.
Dual KL - Position-Velocity Controller: Combines both mechanisms: $\mathcal{L}_{KL} = \beta_{pos} \cdot D_{KL}(\pi_\theta \| \pi_{ref_N}) + \beta_{vel} \cdot D_{KL}(\pi_\theta \| \pi_{k-1})$ The position term prevents long-term drift; the velocity term smooths instantaneous updates.

The full SAGE-GRPO objective combines GRPO, temporal equalization, and the Dual KL regularizer:

\mathcal{L}_{SAGE\text{-}GRPO}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} A_i \cdot \sum_{t=1}^{T} S_t \cdot \log \pi_\theta(x^{(i)}_{t-1} | x^{(i)}_t, c) - \lambda_{KL} \cdot \mathcal{L}_{KL}

An adaptive schedule for $\lambda_{KL}$ is used, warming up from $10^{-7}$ to $10^{-5}$ over 100 steps, then applying conservative feedback control.

Empirical Validation / Results

Experimental Setup: All experiments are conducted on HunyuanVideo 1.5. The VideoAlign evaluator (Liu et al., 2025c) is used as the frozen reward oracle, providing scores for Visual Quality (VQ), Motion Quality (MQ), and Text Alignment (TA). The overall reward is $R = w_{vq}S_{vq} + w_{mq}S_{mq} + w_{ta}S_{ta}$ . Baselines include DanceGRPO, FlowGRPO, and CPS.

Main Results (Table 2): SAGE-GRPO is evaluated under two reward configurations:

Setting A (Averaged): $w_{vq}=1.0, w_{mq}=1.0, w_{ta}=1.0$
Setting B (Alignment-Focused): $w_{vq}=0.5, w_{mq}=0.5, w_{ta}=1.0$

Key Findings:

Under the alignment-focused setting (B), SAGE-GRPO with Dual Moving KL achieves the best performance in Overall reward, VQ, MQ, and CLIPScore, while remaining competitive in TA and PickScore.
Emphasizing text alignment (Setting B) provides a more reliable optimization target and yields more stable gains across both reward and visual metrics compared to the averaged setting.
The Dual Moving KL mechanism is crucial for achieving high and stable rewards while maintaining exploration (see Figure 8).

Table 2: Main Comparison on Video Generation Benchmarks

Method	Configuration	Overall	VQ	MQ	TA	CLIPScore	PickScore
HunyuanVideo 1.5 (Original)	-	0.0654	-0.7539	-0.5870	1.4063	0.5409	0.7397
Setting A: Averaged Rewards $(w_{vq}=1.0, w_{mq}=1.0, w_{ta}=1.0)$
DanceGRPO	w/o KL	0.2768	-0.7589	-0.3852	1.4209	0.5386	0.7378
	w/ Fixed KL	0.0979	-0.8077	-0.5091	1.4147	0.5403	0.7355
FlowGRPO	w/o KL	0.2733	-0.7151	-0.5286	1.5170	0.5443	0.7394
	w/ Fixed KL	0.1880	-0.6771	-0.5912	1.4563	0.5431	0.7407
CPS	w/o KL	0.6343	-0.4855	-0.4021	1.5219	0.5479	0.7412
	w/ Fixed KL	0.0928	-0.7156	-0.5825	1.3908	0.5479	0.7369
SAGE-GRPO	w/o KL	0.4859	-0.6104	-0.4141	1.5104	0.5423	0.7360
	w/ Fixed KL	0.2244	-0.7438	-0.5320	1.5001	0.5446	0.7382
	w/ Dual Mov KL	0.2173	-0.7881	-0.4249	1.4303	0.5430	0.7452
Setting B: Alignment-Focused $(w_{vq}=0.5, w_{mq}=0.5, w_{ta}=1.0)$
DanceGRPO	w/o KL	-0.2172	-0.8854	-0.6218	1.2901	0.5439	0.7352
	w/ Fixed KL	0.1290	-0.7739	-0.5083	1.4112	0.5452	0.7276
FlowGRPO	w/o KL	0.4773	-0.5671	-0.4731	1.5175	0.5403	0.7349
	w/ Fixed KL	0.2103	-0.6654	-0.5506	1.4263	0.5427	0.7408
CPS	w/o KL	0.3694	-0.6650	-0.5325	1.5669	0.5479	0.7311
	w/ Fixed KL	0.3705	-0.6121	-0.4787	1.4613	0.5458	0.7364
SAGE-GRPO	w/o KL	-0.1222	-0.8720	-0.6046	1.3544	0.5404	0.7357
	w/ Fixed KL	0.2857	-0.7062	-0.4425	1.4344	0.5414	0.7377
	w/ Dual Mov KL	0.8066	-0.4765	-0.2384	1.5216	0.5484	0.7420
Bold, underline, and gray indicate the best, second best, and third best results across both settings (A+B).

User Study (Table 3): A pairwise preference study with 29 evaluators on 32 prompts shows strong human preference for SAGE-GRPO over all baselines, especially in Motion Quality.

SAGE-GRPO vs.	Visual Quality	Motion Quality	Semantic Alignment
DanceGRPO	85.9%	75.8%	79.2%
FlowGRPO	83.8%	79.2%	71.9%
CPS	80.2%	70.8%	67.9%

Ablation Studies:

Temporal Gradient Equalizer (Figure 3): Applying the equalizer leads to smoother reward curves with consistent improvement, reducing gradient scale variation by over an order of magnitude.
KL Strategy (Figure 8): Dual Moving KL achieves the highest and most stable final reward while maintaining a stable exploration level, validating the position-velocity controller design.
KL Weight Sensitivity (Figure 7): A two-stage schedule increasing $\lambda_{KL}$ from $10^{-7}$ to $10^{-5}$ yields the strongest and most consistent gains across VQ, MQ, and TA.
Qualitative Analysis (Figures 6, 10, 11, 12): SAGE-GRPO generates videos with reduced temporal jitter, enhanced photorealism and alignment under complex conditions (occlusion, lighting), stronger semantic alignment, and better capture of subtle emotional cues described in prompts.

Theoretical and Practical Implications

Theoretical Implications:

Manifold-Constrained RL: Provides a novel geometric perspective for RL in high-dimensional generative tasks, framing exploration as a problem of staying near a data manifold defined by a pre-trained model.
Precise Stochastic Discretization: Derives a corrected SDE variance via integration and a logarithmic term, moving beyond first-order approximations common in the field.
Dual Trust Regions: Introduces a principled combination of position and velocity control via KL divergence, offering a solution to the stability-plasticity dilemma in generative model alignment.

Practical Implications:

Stable Video Alignment: SAGE-GRPO enables more reliable and effective application of GRPO to large-scale video generation models,