Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
Summary (Overview)
- Core Contribution: Astrolabe is an efficient online reinforcement learning (RL) framework designed to align distilled autoregressive (AR) video models with human visual preferences without requiring expensive re-distillation or solver-coupled reverse-process optimization.
- Key Innovation: Introduces a forward-process RL formulation based on negative-aware fine-tuning, which establishes an implicit policy improvement direction by contrasting positive and negative samples at inference endpoints, avoiding the need for reverse-process unrolling and trajectory storage.
- Scalability Solution: Proposes a streaming training scheme with a rolling KV cache and clip-level group-wise sampling, enabling RL alignment for long videos by applying updates only to local windows while conditioning on detached historical context, ensuring long-range coherence with constant memory overhead.
- Robustness Enhancement: Integrates a multi-reward objective (Visual Quality, Motion Quality, Text Alignment) stabilized by uncertainty-aware selective regularization and dynamic reference updates to mitigate reward hacking and balance optimization across different quality dimensions.
- Empirical Validation: Demonstrates consistent improvements in generation quality (visual aesthetics, motion coherence, text alignment) across multiple distilled AR video models (Self-Forcing, Causal Forcing, LongLive) and various generation settings (short/long, single/multi-prompt) while preserving original inference speed.
Introduction and Theoretical Foundation
Background & Motivation: Recent diffusion models achieve high-quality video synthesis but suffer from prohibitive latency due to multi-step denoising and bidirectional attention, preventing real-time streaming generation. Distilled autoregressive (AR) video models (via Distribution Matching Distillation, DMD) enable efficient streaming inference via KV-caching. However, distillation only mimics the teacher's distribution and lacks optimization for human preference, leading to artifacts and unnatural motion.
Problem: Applying online RL to align these models is challenging. Reward-guided distillation lacks active exploration. Reverse-process RL methods (e.g., DanceGRPO, Flow-GRPO) require log-probability estimation along sampling trajectories, coupling to specific solvers and storing intermediate states, which introduces substantial memory/computation overhead and erodes streaming efficiency.
Goal: Develop an efficient, stable online RL framework for distilled AR video models that avoids these bottlenecks and scales to long videos.
Theoretical Foundation: The work builds upon:
- Autoregressive Video Diffusion Models: Factorize joint distribution as . Each conditional is modeled via flow matching: , where , . The model predicts the velocity field .
- Forward-Process Reinforcement Learning (DiffusionNFT): Avoids reverse-process likelihood estimation by applying rewards directly to the forward process. Uses implicit positive and negative policies defined via interpolation of current and old velocity predictors.
Methodology
Astrolabe combines memory-efficient streaming rollout, online RL optimization (clip-level forward-process RL & streaming long tuning), and reward design/regularization.
1. Memory-Efficient Streaming Rollout
To overcome bottlenecks of temporal credit assignment and memory overhead for long sequences:
- Rolling KV Cache with Frame Sinks: Maintains a restricted visual context window comprising:
- A frame sink of permanently retained frames for global semantic context.
- A rolling window of the most recent frames for local conditioning.
- KV memory remains constant (independent of video length ).
- Clip-level Group-wise Sampling: At step , using the frozen KV cache of , the model decodes independent candidate clips in parallel: Shares context prefix across candidates, reducing rollout cost.
2. Online RL Optimization
- Clip-level Forward-Process RL:
- For each candidate , evaluate composite reward .
- Compute advantage .
- Normalize: .
- For distilled model with , timestep sampled from .
- Construct noised sample to predict velocities and .
- Implicit Policy Loss: Optimize using from DiffusionNFT (Eq. 2), but discards the adaptive loss weighting of DiffusionNFT as it triggers gradient explosion under large discretization gaps in distilled AR settings.
- Streaming Long Tuning: Simulates long-sequence inference while decoupling gradient computation.
- Perform full forward pass to accumulate KV cache up to target step.
- At active training window , detach KV cache of preceding frames from computation graph (serves as historical context).
- Backpropagate gradients only through the active window. Bounds training memory.
3. Reward Design and Regularization
Multi-reward Formulation: Composite reward integrates:
- Visual Quality (VQ): Mean HPSv3 score over top 30% of frames.
- Motion Quality (MQ): VideoAlign on grayscale inputs (focuses on motion dynamics).
- Text Alignment (TA): VideoAlign on RGB inputs for semantic correspondence.
Uncertainty-Aware Selective KL Penalty:
- Quantifies sample uncertainty as rank discrepancy between primary reward model and auxiliary models:
- Masks risky samples (likely reward hacking) using , where is the -th percentile of positive discrepancies (risk ratio ).
- Total objective: applies KL penalty only to masked samples.
- Dynamic Reference Updates: Policy follows EMA update. Reference policy conditionally resets () when policy deviation surpasses or epochs reach .
Implementation Details:
- Base models: Self-Forcing, Causal-Forcing, LongLive.
- Parameter-efficient fine-tuning: Low-Rank Adaptation (LoRA) with rank , .
- Memory efficiency: Single frozen base model shared for and , switch lightweight LoRA during forward pass.
- Training: 48 NVIDIA H200 GPUs, 48 prompts/epoch, group size .
Empirical Validation / Results
Experiments validate effectiveness across short/long, single/multi-prompt settings.
1. Short-Video Single-Prompt Generation
- Evaluation: VBench protocols (946 standard prompts), augmented prompt test set.
- Quantitative Results (Table 1): Astrolabe consistently enhances all Self-Forcing variants, LongLive, and Causal-Forcing.
- Key Metrics: Improvements in HPSv3 (aesthetics) and Motion Quality (MQ) while maintaining original inference throughput.
- Qualitative Results (Figure 3): Generates videos with sharper textures and superior motion coherence.
Table 1: Quantitative results on VBench benchmarks.
| Method | Total ↑ | Quality ↑ | Semantic ↑ | HPSv3 ↑ | MQ ↑ | Throughput ↑ |
|---|---|---|---|---|---|---|
| Diffusion Models | ||||||
| LTX-Video [15] | 80.00 | 82.30 | 70.79 | 8.32 | 1.34 | 8.98 |
| Wan2.1 [42] | 84.26 | 85.30 | 80.09 | 9.26 | 1.62 | 0.78 |
| AR Models | ||||||
| SkyReels-V2 [7] | 82.67 | 84.70 | 74.53 | 9.08 | 1.59 | 0.49 |
| MAGI-1 [40] | 79.18 | 82.04 | 67.74 | 7.95 | 1.52 | 0.19 |
| NOVA [10] | 80.12 | 80.39 | 79.05 | 8.21 | 1.63 | 0.88 |
| PyramidFlow [25] | 81.72 | 84.74 | 69.62 | 8.76 | 1.50 | 6.70 |
| Distilled AR Models | ||||||
| CausVid [55] | 81.20 | 84.05 | 69.80 | 7.56 | 1.22 | 17.0 |
| Reward Forcing [30] | 84.13 | 84.84 | 81.32 | 8.74 | 1.65 | 23.1 |
| Self-Forcing [22] | 83.74 | 84.48 | 80.77 | 9.36 | 1.65 | 17.0 |
| + Ours | 83.79 (+.05) | 84.51 (+.03) | 80.92 (+.15) | 10.72 (+1.36) | 1.71 (+.06) | 17.0 |
| LongLive [51] | 83.22 | 83.68 | 81.37 | 9.38 | 1.51 | 20.7 |
| + Ours | 84.93 (+1.71) | 85.83 (+2.15) | 81.36 (-.01) | 11.03 (+1.65) | 1.64 (+.13) | 20.7 |
| Causal Forcing [64] | 84.04 | 84.59 | 81.84 | 9.48 | 1.69 | 17.0 |
| + Ours | 84.46 (+.42) | 85.15 (+.56) | 81.72 (-.12) | 10.84 (+1.36) | 1.80 (+.11) | 17.0 |
2. Long-Video Single-Prompt Generation
- Evaluation: VBench-Long protocols, generate 30-second videos.
- Quantitative Results (Table 2): Improves performance across long-video benchmarks for all baselines.
- Qualitative Results (Figure 4): Yields sharper textures and superior motion coherence over extended durations.
Table 2: Quantitative results on VBench-Long benchmarks.
| Method | Total ↑ | Quality ↑ | Semantic ↑ | HPSv3 ↑ | MQ ↑ |
|---|---|---|---|---|---|
| SkyReels-V2 [7] | 75.29 | 80.77 | 53.37 | 8.72 | 1.54 |
| FramePack [56] | 81.95 | 83.61 | 75.32 | 8.94 | 1.58 |
| Self-Forcing [22] | 81.59 | 83.82 | 72.70 | 9.12 | 1.61 |
| + Ours | 82.03 | 84.36 | 72.71 | 10.38 | 1.72 |
| LongLive [51] | 83.52 | 85.44 | 75.82 | 9.21 | 1.48 |
| + Ours | 84.07 | 86.12 | 75.87 | 10.67 | 1.64 |
| Causal Forcing [64] | 82.87 | 84.36 | 76.91 | 9.28 | 1.65 |
| + Ours | 84.24 | 86.18 | 76.48 | 10.52 | 1.74 |
3. Long-Video Multi-Prompt Generation
- Evaluation: 100 groups of narrative scripts (6 successive 10-second prompts → 60-second videos). Evaluate clip-wise semantic adherence via CLIP scores.
- Quantitative Results (Table 3): Improves overall generation quality, visual aesthetics, and long-range motion consistency.
- Qualitative Results (Figure 6): Enhances frame-level aesthetics and temporal consistency during complex narrative transitions.
Table 3: Quantitative evaluation on long video generation (CLIP Scores across intervals). | Method | Quality Score ↑ | Consistency Score ↑ | Aesthetic Score ↑ | CLIP Score ↑ (0-60s intervals) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | | SkyReels-V2 [7] | 81.55 | 94.72 | 56.83 | 25.31 | 23.40 | 22.50 | 21.62 | 21.67 | 20.91 | | FramePack [56] | 84.40 | 96.77 | 59.44 | 26.51 | 22.60 | 22.18 | 21.53 | 21.98 | 21.62 | | Self-Forcing [22] | 83.94 | 95.74 | 58.45 | 26.24 | 24.87 | 23.46 | 21.92 | 22.05 | 21.07 | | + Ours | 84.72 | 95.98 | 59.62 | 26.42 | 24.75 | 23.95 | 22.40 | 21.85 | 21.50 | | LongLive [51] | 84.28 | 96.05 | 59.89 | 26.63 | 25.77 | 24.65 | 23.99 | 24.52 | 24.11 | | + Ours | 85.15 | 96.16 | 60.75 | 26.80 | 26.15 | 24.45 | 24.55 | 24.30 | 24.65 | | Causal Forcing [64] | 84.12 | 95.88 | 59.15 | 26.45 | 25.60 | 23.98 | 22.85 | 22.48 | 22.45 | | + Ours | 84.95 | 95.63 | 60.32 | 26.58 | 25.12 | 23.85 | 23.40 | 23.10 | 22.95 |
4. Ablation Studies
- Streaming Training Scheme (Table 4a): Clip-level group-wise sampling with detached context achieves best trade-off: reduces memory by ≈2× compared to clip-level full backpropagation while improving HPSv3 and MQ.
- Reward Design & Regularization (Table 4c, Figure 7a): Single-reward optimization induces hacking (VQ-only collapses into static frames). Multi-reward formulation (VQ+MQ+TA) prevents overfitting and yields balanced improvements. Selective KL penalty with EMA updates outperforms uniform KL or no KL.
- Removing Adaptive Weighting (Figure 8b): DiffusionNFT's adaptive weighting destabilizes distilled AR setting (causes norm explosion and reward collapse). Removing it ensures steady improvements.
- Impact of (Figure 7b): yields higher visual and motion quality compared to .
Table 4: Ablation studies on each component. (a) Streaming Training
| Config | HPSv3 ↑ | MQ ↑ | Mem ↓ |
|---|---|---|---|
| Seq + Full BP | OOM | OOM | > 140 |
| Seq + Detach | 10.21 | 1.72 | 96.4 |
| Clip + Full BP | 10.58 | 1.76 | 112.3 |
| Clip+Detach (Ours) |