Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Summary (Overview)

  • Core Problem: Group Relative Policy Optimization (GRPO) for video generation is less reliable than for images/language due to complex solution spaces. Converting deterministic ODE samplers to SDE for exploration injects excess noise, degrading rollout quality and destabilizing reward-guided alignment.
  • Key Solution: Formulate GRPO as a manifold-constrained exploration problem. The pre-trained model defines a valid video data manifold M\mathcal{M}; the goal is to constrain exploration to its vicinity to preserve rollout validity and reward reliability.
  • Micro-Level Innovation: Propose a Precise Manifold-Aware SDE with a logarithmic curvature correction for accurate noise variance, coupled with a Gradient Norm Equalizer to balance optimization pressure across timesteps.
  • Macro-Level Innovation: Introduce a Dual Trust Region mechanism combining a periodic moving anchor (position control) and stepwise constraints (velocity control) to prevent long-horizon policy drift while maintaining plasticity.
  • Empirical Result: SAGE-GRPO demonstrates consistent gains over baselines (DanceGRPO, FlowGRPO, CPS) on HunyuanVideo1.5 using the VideoAlign reward model, in both reward maximization (VQ, MQ, TA) and visual metrics (CLIPScore, PickScore).

Introduction and Theoretical Foundation

Group Relative Policy Optimization (GRPO) has become a standard method for aligning generative models (diffusion/flow matching) with reward signals. However, its application to video generation remains significantly less reliable than for language models and images. This performance gap stems from the inherently large and structured solution space of videos.

The standard GRPO training procedure requires converting a deterministic ODE sampler into a stochastic SDE sampler to enable exploration through diverse samples. Current video GRPO baselines (e.g., DanceGRPO, FlowGRPO) derive the SDE noise standard deviation using Euler-style discretization and first-order approximations. This introduces first-order truncation error, injecting excess noise energy during sampling (see Figure 1a.1). This excess noise:

  1. Lowers the quality of generated rollouts, especially in high-noise steps.
  2. Makes reward evaluations less reliable, destabilizing the post-training alignment process.

The authors reframe the core problem. A pre-trained video generation model parameterized by θ\theta defines a valid data manifold MRD\mathcal{M} \subset \mathbb{R}^D. While the initial parameters θ0\theta_0 are insufficient for the target reward, GRPO must update θ\theta through exploration while keeping trajectories within the vicinity of M\mathcal{M} to ensure rollouts remain valid. Conventional SDE exploration can overestimate noise variance, pushing states ztz_t away from M\mathcal{M} and producing temporal artifacts (see Figure 2, red trajectory).

Thus, the central challenge is: How to constrain exploration within the vicinity of the data manifold to improve rollouts while keeping reward evaluation reliable?

Methodology

SAGE-GRPO (Stable Alignment via Exploration) addresses this challenge with a unified strategy operating at both micro (sampling) and macro (policy optimization) levels.

3.1 Preliminaries: Flow Matching and GRPO

  • Flow Matching / Rectified Flow: Generation is modeled as transport along a probability path pt(x)p_t(x) via an ODE: dxtdt=vθ(xt,t)\frac{dx_t}{dt} = v_\theta(x_t, t) Using the linear interpolation path xt=(1σt)x0+σtz1x_t = (1 - \sigma_t)x_0 + \sigma_t z_1, the velocity field is: vθ(xt,t)=dxtdt=dσtdt(x0z1)=11σt(xtx0)v_\theta(x_t, t) = \frac{dx_t}{dt} = -\frac{d\sigma_t}{dt}(x_0 - z_1) = \frac{1}{1-\sigma_t}(x_t - x_0)
  • Group Relative Policy Optimization (GRPO): For a prompt cc, GRPO samples a group of GG rollouts and optimizes the policy using a group-normalized advantage: LGRPO(θ)=1Gi=1GAit=1Tlogπθ(xt1(i)xt(i),c)\mathcal{L}_{GRPO}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} A_i \cdot \sum_{t=1}^{T} \log \pi_\theta(x^{(i)}_{t-1} | x^{(i)}_t, c) where AiA_i is the normalized advantage for rollout ii.

3.2 SAGE-GRPO Framework

3.2.1 Micro-Level Exploration: Precise SDE and Gradient Equalization

The goal is to perturb the Rectified Flow ODE with a marginal-preserving SDE whose noise stays aligned with the manifold M\mathcal{M}. The critical step is computing the correct noise standard deviation Σt1/2\Sigma^{1/2}_t.

For a marginal-preserving SDE with diffusion coefficient εt=ησt/(1σt)\varepsilon_t = \eta \sqrt{\sigma_t / (1-\sigma_t)} (where η\eta is an exploration scaling factor), the integrated variance over the interval [σt+1,σt][\sigma_{t+1}, \sigma_t] is derived as:

Σt=σtσt+1εs2ds=η2[(σtσt+1)+log(1σt+11σt)]\Sigma_t = \int_{\sigma_t}^{\sigma_{t+1}} \varepsilon^2_s ds = \eta^2 \left[ -(\sigma_t - \sigma_{t+1}) + \log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right) \right]

The logarithmic term log(1σt+11σt)\log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right) accounts for the geometric contraction of the signal coefficient (1σt)(1-\sigma_t), which linear approximations fail to capture. The noise standard deviation is:

Σt1/2=η(σtσt+1)+log(1σt+11σt)\Sigma^{1/2}_t = \eta \sqrt{ -(\sigma_t - \sigma_{t+1}) + \log\left( \frac{1-\sigma_{t+1}}{1-\sigma_t} \right) }

Applying Euler-Maruyama discretization with timestep Δt=σtσt+1\Delta t = \sigma_t - \sigma_{t+1}:

xt+Δt=xt+vθ(xt,t)Δt+Σt2sθ(xt)+Σt1/2ϵ,ϵN(0,I)x_{t+\Delta t} = x_t + v_\theta(x_t, t)\Delta t + \frac{\Sigma_t}{2} s_\theta(x_t) + \Sigma^{1/2}_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

where sθ(xt)(xtx^0)/σt2s_\theta(x_t) \approx -(x_t - \hat{x}_0)/\sigma^2_t is the score function estimate. The Itô correction term Σt2sθ(xt)\frac{\Sigma_t}{2} s_\theta(x_t) ensures consistency with Rectified Flow marginals.

Gradient Norm Equalizer: Even with the corrected SDE, an inherent signal-to-noise imbalance exists across timesteps. For a Gaussian transition π(xt1xt)=N(μθ,ΣtI)\pi(x_{t-1}|x_t) = \mathcal{N}(\mu_\theta, \Sigma_t I), the gradient norm scales as:

μlogπ1Σt1/2\|\nabla_\mu \log \pi\| \propto \frac{1}{\Sigma^{1/2}_t}

This causes gradients to vanish at high noise (t1t \to 1) and explode at low noise (t0t \to 0), biasing learning. To counteract this, a per-timestep gradient scale NtN_t is estimated from SDE parameters, and a robust normalization is applied:

St=Median({Nτ}τ=1T)Nt+ϵS_t = \frac{\text{Median}(\{N_\tau\}_{\tau=1}^T)}{N_t + \epsilon}

where ϵ\epsilon is a small constant. This equalization normalizes optimization pressure across timesteps.

3.2.2 Macro-Level Exploration: Dual Trust Region Optimization

To prevent long-horizon policy drift from the manifold, KL divergence is used as a dynamic anchoring mechanism.

  • KL Divergence as Anchor: For a Gaussian policy, DKL(πθπref)(μθμref)22Σt2D_{KL}(\pi_\theta \| \pi_{ref}) \approx \frac{(\mu_\theta - \mu_{ref})^2}{2\Sigma^2_t}. The choice of reference policy πref\pi_{ref} determines the constraint nature.
  • Fixed KL: Uses πref=π0\pi_{ref} = \pi_0 (initial model). This is a hard constraint that can prevent reaching the optimal policy π\pi^* if it is far from π0\pi_0, leading to underfitting.
  • Step-wise KL: Uses πref=πk1\pi_{ref} = \pi_{k-1} (previous step's policy). This acts as a velocity limit, restricting update magnitude per step but does not bound cumulative displacement θkθ0\|\theta_k - \theta_0\|, allowing unbounded drift over many steps.
  • Periodical Moving KL: Uses πref=πkN\pi_{ref} = \pi_{k-N}, updated every NN steps. This provides position control by creating a dynamic trust region that periodically resets the safe zone to a more manifold-consistent policy, enabling staged exploration.
  • Dual KL - Position-Velocity Controller: Combines both mechanisms: LKL=βposDKL(πθπrefN)+βvelDKL(πθπk1)\mathcal{L}_{KL} = \beta_{pos} \cdot D_{KL}(\pi_\theta \| \pi_{ref_N}) + \beta_{vel} \cdot D_{KL}(\pi_\theta \| \pi_{k-1}) The position term prevents long-term drift; the velocity term smooths instantaneous updates.

The full SAGE-GRPO objective combines GRPO, temporal equalization, and the Dual KL regularizer:

LSAGE-GRPO(θ)=1Gi=1GAit=1TStlogπθ(xt1(i)xt(i),c)λKLLKL\mathcal{L}_{SAGE\text{-}GRPO}(\theta) = -\frac{1}{G}\sum_{i=1}^{G} A_i \cdot \sum_{t=1}^{T} S_t \cdot \log \pi_\theta(x^{(i)}_{t-1} | x^{(i)}_t, c) - \lambda_{KL} \cdot \mathcal{L}_{KL}

An adaptive schedule for λKL\lambda_{KL} is used, warming up from 10710^{-7} to 10510^{-5} over 100 steps, then applying conservative feedback control.

Empirical Validation / Results

Experimental Setup: All experiments are conducted on HunyuanVideo 1.5. The VideoAlign evaluator (Liu et al., 2025c) is used as the frozen reward oracle, providing scores for Visual Quality (VQ), Motion Quality (MQ), and Text Alignment (TA). The overall reward is R=wvqSvq+wmqSmq+wtaStaR = w_{vq}S_{vq} + w_{mq}S_{mq} + w_{ta}S_{ta}. Baselines include DanceGRPO, FlowGRPO, and CPS.

Main Results (Table 2): SAGE-GRPO is evaluated under two reward configurations:

  • Setting A (Averaged): wvq=1.0,wmq=1.0,wta=1.0w_{vq}=1.0, w_{mq}=1.0, w_{ta}=1.0
  • Setting B (Alignment-Focused): wvq=0.5,wmq=0.5,wta=1.0w_{vq}=0.5, w_{mq}=0.5, w_{ta}=1.0

Key Findings:

  1. Under the alignment-focused setting (B), SAGE-GRPO with Dual Moving KL achieves the best performance in Overall reward, VQ, MQ, and CLIPScore, while remaining competitive in TA and PickScore.
  2. Emphasizing text alignment (Setting B) provides a more reliable optimization target and yields more stable gains across both reward and visual metrics compared to the averaged setting.
  3. The Dual Moving KL mechanism is crucial for achieving high and stable rewards while maintaining exploration (see Figure 8).

Table 2: Main Comparison on Video Generation Benchmarks

MethodConfigurationOverallVQMQTACLIPScorePickScore
HunyuanVideo 1.5 (Original)-0.0654-0.7539-0.58701.40630.54090.7397
Setting A: Averaged Rewards (wvq=1.0,wmq=1.0,wta=1.0)(w_{vq}=1.0, w_{mq}=1.0, w_{ta}=1.0)
DanceGRPOw/o KL0.2768-0.7589-0.38521.42090.53860.7378
w/ Fixed KL0.0979-0.8077-0.50911.41470.54030.7355
FlowGRPOw/o KL0.2733-0.7151-0.52861.51700.54430.7394
w/ Fixed KL0.1880-0.6771-0.59121.45630.54310.7407
CPSw/o KL0.6343-0.4855-0.40211.52190.54790.7412
w/ Fixed KL0.0928-0.7156-0.58251.39080.54790.7369
SAGE-GRPOw/o KL0.4859-0.6104-0.41411.51040.54230.7360
w/ Fixed KL0.2244-0.7438-0.53201.50010.54460.7382
w/ Dual Mov KL0.2173-0.7881-0.42491.43030.54300.7452
Setting B: Alignment-Focused (wvq=0.5,wmq=0.5,wta=1.0)(w_{vq}=0.5, w_{mq}=0.5, w_{ta}=1.0)
DanceGRPOw/o KL-0.2172-0.8854-0.62181.29010.54390.7352
w/ Fixed KL0.1290-0.7739-0.50831.41120.54520.7276
FlowGRPOw/o KL0.4773-0.5671-0.47311.51750.54030.7349
w/ Fixed KL0.2103-0.6654-0.55061.42630.54270.7408
CPSw/o KL0.3694-0.6650-0.53251.56690.54790.7311
w/ Fixed KL0.3705-0.6121-0.47871.46130.54580.7364
SAGE-GRPOw/o KL-0.1222-0.8720-0.60461.35440.54040.7357
w/ Fixed KL0.2857-0.7062-0.44251.43440.54140.7377
w/ Dual Mov KL0.8066-0.4765-0.23841.52160.54840.7420
Bold, underline, and gray indicate the best, second best, and third best results across both settings (A+B).

User Study (Table 3): A pairwise preference study with 29 evaluators on 32 prompts shows strong human preference for SAGE-GRPO over all baselines, especially in Motion Quality.

SAGE-GRPO vs.Visual QualityMotion QualitySemantic Alignment
DanceGRPO85.9%75.8%79.2%
FlowGRPO83.8%79.2%71.9%
CPS80.2%70.8%67.9%

Ablation Studies:

  • Temporal Gradient Equalizer (Figure 3): Applying the equalizer leads to smoother reward curves with consistent improvement, reducing gradient scale variation by over an order of magnitude.
  • KL Strategy (Figure 8): Dual Moving KL achieves the highest and most stable final reward while maintaining a stable exploration level, validating the position-velocity controller design.
  • KL Weight Sensitivity (Figure 7): A two-stage schedule increasing λKL\lambda_{KL} from 10710^{-7} to 10510^{-5} yields the strongest and most consistent gains across VQ, MQ, and TA.
  • Qualitative Analysis (Figures 6, 10, 11, 12): SAGE-GRPO generates videos with reduced temporal jitter, enhanced photorealism and alignment under complex conditions (occlusion, lighting), stronger semantic alignment, and better capture of subtle emotional cues described in prompts.

Theoretical and Practical Implications

Theoretical Implications:

  1. Manifold-Constrained RL: Provides a novel geometric perspective for RL in high-dimensional generative tasks, framing exploration as a problem of staying near a data manifold defined by a pre-trained model.
  2. Precise Stochastic Discretization: Derives a corrected SDE variance via integration and a logarithmic term, moving beyond first-order approximations common in the field.
  3. Dual Trust Regions: Introduces a principled combination of position and velocity control via KL divergence, offering a solution to the stability-plasticity dilemma in generative model alignment.

Practical Implications:

  1. Stable Video Alignment: SAGE-GRPO enables more reliable and effective application of GRPO to large-scale video generation models,