Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Summary (Overview)

  • Core Contribution: Astrolabe is an efficient online reinforcement learning (RL) framework designed to align distilled autoregressive (AR) video models with human visual preferences without requiring expensive re-distillation or solver-coupled reverse-process optimization.
  • Key Innovation: Introduces a forward-process RL formulation based on negative-aware fine-tuning, which establishes an implicit policy improvement direction by contrasting positive and negative samples at inference endpoints, avoiding the need for reverse-process unrolling and trajectory storage.
  • Scalability Solution: Proposes a streaming training scheme with a rolling KV cache and clip-level group-wise sampling, enabling RL alignment for long videos by applying updates only to local windows while conditioning on detached historical context, ensuring long-range coherence with constant memory overhead.
  • Robustness Enhancement: Integrates a multi-reward objective (Visual Quality, Motion Quality, Text Alignment) stabilized by uncertainty-aware selective regularization and dynamic reference updates to mitigate reward hacking and balance optimization across different quality dimensions.
  • Empirical Validation: Demonstrates consistent improvements in generation quality (visual aesthetics, motion coherence, text alignment) across multiple distilled AR video models (Self-Forcing, Causal Forcing, LongLive) and various generation settings (short/long, single/multi-prompt) while preserving original inference speed.

Introduction and Theoretical Foundation

Background & Motivation: Recent diffusion models achieve high-quality video synthesis but suffer from prohibitive latency due to multi-step denoising and bidirectional attention, preventing real-time streaming generation. Distilled autoregressive (AR) video models (via Distribution Matching Distillation, DMD) enable efficient streaming inference via KV-caching. However, distillation only mimics the teacher's distribution and lacks optimization for human preference, leading to artifacts and unnatural motion.

Problem: Applying online RL to align these models is challenging. Reward-guided distillation lacks active exploration. Reverse-process RL methods (e.g., DanceGRPO, Flow-GRPO) require log-probability estimation along sampling trajectories, coupling to specific solvers and storing intermediate states, which introduces substantial memory/computation overhead and erodes streaming efficiency.

Goal: Develop an efficient, stable online RL framework for distilled AR video models that avoids these bottlenecks and scales to long videos.

Theoretical Foundation: The work builds upon:

  1. Autoregressive Video Diffusion Models: Factorize joint distribution as p(x1:N)=i=1Np(xix<i)p(x_{1:N}) = \prod_{i=1}^{N} p(x_i | x_{<i}). Each conditional is modeled via flow matching: xti=(1t)xi+tϵix_t^i = (1 - t)x_i + t\epsilon_i, where ϵiN(0,I)\epsilon_i \sim N(0, I), t[0,1]t \in [0,1]. The model predicts the velocity field vθv_\theta.
  2. Forward-Process Reinforcement Learning (DiffusionNFT): Avoids reverse-process likelihood estimation by applying rewards directly to the forward process. Uses implicit positive and negative policies defined via interpolation of current and old velocity predictors.

Methodology

Astrolabe combines memory-efficient streaming rollout, online RL optimization (clip-level forward-process RL & streaming long tuning), and reward design/regularization.

1. Memory-Efficient Streaming Rollout

To overcome bottlenecks of temporal credit assignment and memory overhead for long sequences:

  • Rolling KV Cache with Frame Sinks: Maintains a restricted visual context window CnC_n comprising:
    • A frame sink of SS permanently retained frames for global semantic context.
    • A rolling window of the LL most recent frames for local conditioning.
    • KV memory remains constant (independent of video length NN).
  • Clip-level Group-wise Sampling: At step nn, using the frozen KV cache of CnC_n, the model decodes GG independent candidate clips in parallel: xn(i)πθ(Cn,c),for i{1,...,G}x_n^{(i)} \sim \pi_\theta(\cdot|C_n, c), \quad \text{for } i \in \{1, ..., G\} Shares context prefix across candidates, reducing rollout cost.

2. Online RL Optimization

  • Clip-level Forward-Process RL:
    • For each candidate xn(i)x_n^{(i)}, evaluate composite reward R(xn(i),c)R(x_n^{(i)}, c).
    • Compute advantage A(i)=R(xn(i),c)1Gj=1GR(xn(j),c)A^{(i)} = R(x_n^{(i)}, c) - \frac{1}{G} \sum_{j=1}^{G} R(x_n^{(j)}, c).
    • Normalize: r~i=clip(A(i)/Amax)/2+0.5\tilde{r}_i = \text{clip}(A^{(i)} / A_{\max}) / 2 + 0.5.
    • For distilled model with T=4T=4, timestep tt sampled from TdistillT_{\text{distill}}.
    • Construct noised sample xnt,(i)x_{n}^{t,(i)} to predict velocities vθv_\theta and vθoldv_{\theta_{\text{old}}}.
    • Implicit Policy Loss: Optimize using LpolicyL_{\text{policy}} from DiffusionNFT (Eq. 2), but discards the adaptive loss weighting of DiffusionNFT as it triggers gradient explosion under large discretization gaps in distilled AR settings.
    Lpolicy=r~v+vtarget22+(1r~)vvtarget22L_{\text{policy}} = \tilde{r} \| v^+ - v_{\text{target}} \|_2^2 + (1 - \tilde{r}) \| v^- - v_{\text{target}} \|_2^2 where implicit policies are: v+=(1β)vθold+βvθ,v=(1+β)vθoldβvθv^+ = (1 - \beta) v_{\theta_{\text{old}}} + \beta v_\theta, \quad v^- = (1 + \beta) v_{\theta_{\text{old}}} - \beta v_\theta β\beta controls interpolation strength.
  • Streaming Long Tuning: Simulates long-sequence inference while decoupling gradient computation.
    • Perform full forward pass to accumulate KV cache up to target step.
    • At active training window xnx_n, detach KV cache of preceding frames x<nx_{<n} from computation graph (serves as historical context).
    • Backpropagate gradients only through the active window. Bounds training memory.

3. Reward Design and Regularization

Multi-reward Formulation: Composite reward integrates:

  1. Visual Quality (VQ): Mean HPSv3 score over top 30% of frames.
  2. Motion Quality (MQ): VideoAlign on grayscale inputs (focuses on motion dynamics).
  3. Text Alignment (TA): VideoAlign on RGB inputs for semantic correspondence.

Uncertainty-Aware Selective KL Penalty:

  • Quantifies sample uncertainty as rank discrepancy between primary reward model pp and M1M-1 auxiliary models: Δrank(i)=rankp(i)1M1mprankm(i)\Delta_{\text{rank}}^{(i)} = \text{rank}_p^{(i)} - \frac{1}{M-1} \sum_{m \neq p} \text{rank}_m^{(i)}
  • Masks risky samples (likely reward hacking) using M(i)=1[Δrank(i)>τ]M^{(i)} = 1[\Delta_{\text{rank}}^{(i)} > \tau], where τ\tau is the (1ρ)(1-\rho)-th percentile of positive discrepancies (risk ratio ρ\rho).
  • Total objective: L=Lpolicy+λKLLKLL = L_{\text{policy}} + \lambda_{\text{KL}} L_{\text{KL}} applies KL penalty only to masked samples.
  • Dynamic Reference Updates: Policy θold\theta_{\text{old}} follows EMA update. Reference policy θref\theta_{\text{ref}} conditionally resets (θrefθ\theta_{\text{ref}} \gets \theta) when policy deviation surpasses τKL\tau_{\text{KL}} or epochs reach KmaxK_{\max}.

Implementation Details:

  • Base models: Self-Forcing, Causal-Forcing, LongLive.
  • Parameter-efficient fine-tuning: Low-Rank Adaptation (LoRA) with rank r=256r=256, α=256\alpha=256.
  • Memory efficiency: Single frozen base model shared for vθv_\theta and vθoldv_{\theta_{\text{old}}}, switch lightweight LoRA during forward pass.
  • Training: 48 NVIDIA H200 GPUs, 48 prompts/epoch, group size G=24G=24.

Empirical Validation / Results

Experiments validate effectiveness across short/long, single/multi-prompt settings.

1. Short-Video Single-Prompt Generation

  • Evaluation: VBench protocols (946 standard prompts), augmented prompt test set.
  • Quantitative Results (Table 1): Astrolabe consistently enhances all Self-Forcing variants, LongLive, and Causal-Forcing.
  • Key Metrics: Improvements in HPSv3 (aesthetics) and Motion Quality (MQ) while maintaining original inference throughput.
  • Qualitative Results (Figure 3): Generates videos with sharper textures and superior motion coherence.

Table 1: Quantitative results on VBench benchmarks.

MethodTotal ↑Quality ↑Semantic ↑HPSv3 ↑MQ ↑Throughput ↑
Diffusion Models
LTX-Video [15]80.0082.3070.798.321.348.98
Wan2.1 [42]84.2685.3080.099.261.620.78
AR Models
SkyReels-V2 [7]82.6784.7074.539.081.590.49
MAGI-1 [40]79.1882.0467.747.951.520.19
NOVA [10]80.1280.3979.058.211.630.88
PyramidFlow [25]81.7284.7469.628.761.506.70
Distilled AR Models
CausVid [55]81.2084.0569.807.561.2217.0
Reward Forcing [30]84.1384.8481.328.741.6523.1
Self-Forcing [22]83.7484.4880.779.361.6517.0
+ Ours83.79 (+.05)84.51 (+.03)80.92 (+.15)10.72 (+1.36)1.71 (+.06)17.0
LongLive [51]83.2283.6881.379.381.5120.7
+ Ours84.93 (+1.71)85.83 (+2.15)81.36 (-.01)11.03 (+1.65)1.64 (+.13)20.7
Causal Forcing [64]84.0484.5981.849.481.6917.0
+ Ours84.46 (+.42)85.15 (+.56)81.72 (-.12)10.84 (+1.36)1.80 (+.11)17.0

2. Long-Video Single-Prompt Generation

  • Evaluation: VBench-Long protocols, generate 30-second videos.
  • Quantitative Results (Table 2): Improves performance across long-video benchmarks for all baselines.
  • Qualitative Results (Figure 4): Yields sharper textures and superior motion coherence over extended durations.

Table 2: Quantitative results on VBench-Long benchmarks.

MethodTotal ↑Quality ↑Semantic ↑HPSv3 ↑MQ ↑
SkyReels-V2 [7]75.2980.7753.378.721.54
FramePack [56]81.9583.6175.328.941.58
Self-Forcing [22]81.5983.8272.709.121.61
+ Ours82.0384.3672.7110.381.72
LongLive [51]83.5285.4475.829.211.48
+ Ours84.0786.1275.8710.671.64
Causal Forcing [64]82.8784.3676.919.281.65
+ Ours84.2486.1876.4810.521.74

3. Long-Video Multi-Prompt Generation

  • Evaluation: 100 groups of narrative scripts (6 successive 10-second prompts → 60-second videos). Evaluate clip-wise semantic adherence via CLIP scores.
  • Quantitative Results (Table 3): Improves overall generation quality, visual aesthetics, and long-range motion consistency.
  • Qualitative Results (Figure 6): Enhances frame-level aesthetics and temporal consistency during complex narrative transitions.

Table 3: Quantitative evaluation on long video generation (CLIP Scores across intervals). | Method | Quality Score ↑ | Consistency Score ↑ | Aesthetic Score ↑ | CLIP Score ↑ (0-60s intervals) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | | SkyReels-V2 [7] | 81.55 | 94.72 | 56.83 | 25.31 | 23.40 | 22.50 | 21.62 | 21.67 | 20.91 | | FramePack [56] | 84.40 | 96.77 | 59.44 | 26.51 | 22.60 | 22.18 | 21.53 | 21.98 | 21.62 | | Self-Forcing [22] | 83.94 | 95.74 | 58.45 | 26.24 | 24.87 | 23.46 | 21.92 | 22.05 | 21.07 | | + Ours | 84.72 | 95.98 | 59.62 | 26.42 | 24.75 | 23.95 | 22.40 | 21.85 | 21.50 | | LongLive [51] | 84.28 | 96.05 | 59.89 | 26.63 | 25.77 | 24.65 | 23.99 | 24.52 | 24.11 | | + Ours | 85.15 | 96.16 | 60.75 | 26.80 | 26.15 | 24.45 | 24.55 | 24.30 | 24.65 | | Causal Forcing [64] | 84.12 | 95.88 | 59.15 | 26.45 | 25.60 | 23.98 | 22.85 | 22.48 | 22.45 | | + Ours | 84.95 | 95.63 | 60.32 | 26.58 | 25.12 | 23.85 | 23.40 | 23.10 | 22.95 |

4. Ablation Studies

  • Streaming Training Scheme (Table 4a): Clip-level group-wise sampling with detached context achieves best trade-off: reduces memory by ≈2× compared to clip-level full backpropagation while improving HPSv3 and MQ.
  • Reward Design & Regularization (Table 4c, Figure 7a): Single-reward optimization induces hacking (VQ-only collapses into static frames). Multi-reward formulation (VQ+MQ+TA) prevents overfitting and yields balanced improvements. Selective KL penalty with EMA updates outperforms uniform KL or no KL.
  • Removing Adaptive Weighting (Figure 8b): DiffusionNFT's adaptive weighting destabilizes distilled AR setting (causes x0x_0 norm explosion and reward collapse). Removing it ensures steady improvements.
  • Impact of β\beta (Figure 7b): β=1\beta=1 yields higher visual and motion quality compared to β=0.1\beta=0.1.

Table 4: Ablation studies on each component. (a) Streaming Training

ConfigHPSv3 ↑MQ ↑Mem ↓
Seq + Full BPOOMOOM> 140
Seq + Detach10.211.7296.4
Clip + Full BP10.581.76112.3
Clip+Detach (Ours)