# Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

> Astrolabe introduces a forward-process RL framework that efficiently aligns distilled autoregressive video models with human preferences using streaming training and multi-reward optimization, without reverse-process unrolling.

- **Source:** [arXiv](https://arxiv.org/abs/2603.17051)
- **Published:** 2026-03-24
- **Permalink:** https://picx.dev/p/lzUWw4
- **Whiteboard:** https://picx.dev/p/lzUWw4/image

## Summary

# Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

## Summary (Overview)
* **Core Contribution:** Astrolabe is an efficient online reinforcement learning (RL) framework designed to align distilled autoregressive (AR) video models with human visual preferences without requiring expensive re-distillation or solver-coupled reverse-process optimization.
* **Key Innovation:** Introduces a **forward-process RL formulation** based on negative-aware fine-tuning, which establishes an implicit policy improvement direction by contrasting positive and negative samples at inference endpoints, avoiding the need for reverse-process unrolling and trajectory storage.
* **Scalability Solution:** Proposes a **streaming training scheme** with a rolling KV cache and clip-level group-wise sampling, enabling RL alignment for long videos by applying updates only to local windows while conditioning on detached historical context, ensuring long-range coherence with constant memory overhead.
* **Robustness Enhancement:** Integrates a **multi-reward objective** (Visual Quality, Motion Quality, Text Alignment) stabilized by **uncertainty-aware selective regularization** and dynamic reference updates to mitigate reward hacking and balance optimization across different quality dimensions.
* **Empirical Validation:** Demonstrates consistent improvements in generation quality (visual aesthetics, motion coherence, text alignment) across multiple distilled AR video models (Self-Forcing, Causal Forcing, LongLive) and various generation settings (short/long, single/multi-prompt) while preserving original inference speed.

## Introduction and Theoretical Foundation
**Background & Motivation:** Recent diffusion models achieve high-quality video synthesis but suffer from prohibitive latency due to multi-step denoising and bidirectional attention, preventing real-time streaming generation. **Distilled autoregressive (AR) video models** (via Distribution Matching Distillation, DMD) enable efficient streaming inference via KV-caching. However, distillation only mimics the teacher's distribution and lacks optimization for human preference, leading to artifacts and unnatural motion.

**Problem:** Applying online RL to align these models is challenging. Reward-guided distillation lacks active exploration. Reverse-process RL methods (e.g., DanceGRPO, Flow-GRPO) require log-probability estimation along sampling trajectories, coupling to specific solvers and storing intermediate states, which introduces substantial memory/computation overhead and erodes streaming efficiency.

**Goal:** Develop an efficient, stable online RL framework for distilled AR video models that avoids these bottlenecks and scales to long videos.

**Theoretical Foundation:** The work builds upon:
1.  **Autoregressive Video Diffusion Models:** Factorize joint distribution as $p(x_{1:N}) = \prod_{i=1}^{N} p(x_i | x_{<i})$. Each conditional is modeled via flow matching: $x_t^i = (1 - t)x_i + t\epsilon_i$, where $\epsilon_i \sim N(0, I)$, $t \in [0,1]$. The model predicts the velocity field $v_\theta$.
2.  **Forward-Process Reinforcement Learning (DiffusionNFT):** Avoids reverse-process likelihood estimation by applying rewards directly to the forward process. Uses implicit positive and negative policies defined via interpolation of current and old velocity predictors.

## Methodology
Astrolabe combines **memory-efficient streaming rollout**, **online RL optimization** (clip-level forward-process RL & streaming long tuning), and **reward design/regularization**.

### 1. Memory-Efficient Streaming Rollout
To overcome bottlenecks of temporal credit assignment and memory overhead for long sequences:
* **Rolling KV Cache with Frame Sinks:** Maintains a restricted visual context window $C_n$ comprising:
    * A **frame sink** of $S$ permanently retained frames for global semantic context.
    * A **rolling window** of the $L$ most recent frames for local conditioning.
    * KV memory remains constant (independent of video length $N$).
* **Clip-level Group-wise Sampling:** At step $n$, using the frozen KV cache of $C_n$, the model decodes $G$ independent candidate clips in parallel:
    $$x_n^{(i)} \sim \pi_\theta(\cdot|C_n, c), \quad \text{for } i \in \{1, ..., G\}$$
    Shares context prefix across candidates, reducing rollout cost.

### 2. Online RL Optimization
* **Clip-level Forward-Process RL:** 
    * For each candidate $x_n^{(i)}$, evaluate composite reward $R(x_n^{(i)}, c)$.
    * Compute advantage $A^{(i)} = R(x_n^{(i)}, c) - \frac{1}{G} \sum_{j=1}^{G} R(x_n^{(j)}, c)$.
    * Normalize: $\tilde{r}_i = \text{clip}(A^{(i)} / A_{\max}) / 2 + 0.5$.
    * For distilled model with $T=4$, timestep $t$ sampled from $T_{\text{distill}}$.
    * Construct noised sample $x_{n}^{t,(i)}$ to predict velocities $v_\theta$ and $v_{\theta_{\text{old}}}$.
    * **Implicit Policy Loss:** Optimize using $L_{\text{policy}}$ from DiffusionNFT (Eq. 2), but **discards the adaptive loss weighting** of DiffusionNFT as it triggers gradient explosion under large discretization gaps in distilled AR settings.
    $$L_{\text{policy}} = \tilde{r} \| v^+ - v_{\text{target}} \|_2^2 + (1 - \tilde{r}) \| v^- - v_{\text{target}} \|_2^2$$
    where implicit policies are:
    $$v^+ = (1 - \beta) v_{\theta_{\text{old}}} + \beta v_\theta, \quad v^- = (1 + \beta) v_{\theta_{\text{old}}} - \beta v_\theta$$
    $\beta$ controls interpolation strength.
* **Streaming Long Tuning:** Simulates long-sequence inference while decoupling gradient computation.
    * Perform full forward pass to accumulate KV cache up to target step.
    * At active training window $x_n$, **detach KV cache of preceding frames $x_{<n}$** from computation graph (serves as historical context).
    * Backpropagate gradients only through the active window. Bounds training memory.

### 3. Reward Design and Regularization
**Multi-reward Formulation:** Composite reward integrates:
1.  **Visual Quality (VQ):** Mean HPSv3 score over top 30% of frames.
2.  **Motion Quality (MQ):** VideoAlign on grayscale inputs (focuses on motion dynamics).
3.  **Text Alignment (TA):** VideoAlign on RGB inputs for semantic correspondence.

**Uncertainty-Aware Selective KL Penalty:** 
* Quantifies sample uncertainty as rank discrepancy between primary reward model $p$ and $M-1$ auxiliary models:
    $$\Delta_{\text{rank}}^{(i)} = \text{rank}_p^{(i)} - \frac{1}{M-1} \sum_{m \neq p} \text{rank}_m^{(i)}$$
* Masks risky samples (likely reward hacking) using $M^{(i)} = 1[\Delta_{\text{rank}}^{(i)} > \tau]$, where $\tau$ is the $(1-\rho)$-th percentile of positive discrepancies (risk ratio $\rho$).
* **Total objective:** $L = L_{\text{policy}} + \lambda_{\text{KL}} L_{\text{KL}}$ applies KL penalty **only to masked samples**.
* **Dynamic Reference Updates:** Policy $\theta_{\text{old}}$ follows EMA update. Reference policy $\theta_{\text{ref}}$ conditionally resets ($\theta_{\text{ref}} \gets \theta$) when policy deviation surpasses $\tau_{\text{KL}}$ or epochs reach $K_{\max}$.

**Implementation Details:** 
* Base models: Self-Forcing, Causal-Forcing, LongLive.
* Parameter-efficient fine-tuning: Low-Rank Adaptation (LoRA) with rank $r=256$, $\alpha=256$.
* Memory efficiency: Single frozen base model shared for $v_\theta$ and $v_{\theta_{\text{old}}}$, switch lightweight LoRA during forward pass.
* Training: 48 NVIDIA H200 GPUs, 48 prompts/epoch, group size $G=24$.

## Empirical Validation / Results
Experiments validate effectiveness across short/long, single/multi-prompt settings.

### 1. Short-Video Single-Prompt Generation
* **Evaluation:** VBench protocols (946 standard prompts), augmented prompt test set.
* **Quantitative Results (Table 1):** Astrolabe consistently enhances all Self-Forcing variants, LongLive, and Causal-Forcing.
* **Key Metrics:** Improvements in HPSv3 (aesthetics) and Motion Quality (MQ) while maintaining original inference throughput.
* **Qualitative Results (Figure 3):** Generates videos with sharper textures and superior motion coherence.

**Table 1: Quantitative results on VBench benchmarks.**
| Method | Total ↑ | Quality ↑ | Semantic ↑ | HPSv3 ↑ | MQ ↑ | Throughput ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| **Diffusion Models** | | | | | | |
| LTX-Video [15] | 80.00 | 82.30 | 70.79 | 8.32 | 1.34 | 8.98 |
| Wan2.1 [42] | 84.26 | 85.30 | 80.09 | 9.26 | 1.62 | 0.78 |
| **AR Models** | | | | | | |
| SkyReels-V2 [7] | 82.67 | 84.70 | 74.53 | 9.08 | 1.59 | 0.49 |
| MAGI-1 [40] | 79.18 | 82.04 | 67.74 | 7.95 | 1.52 | 0.19 |
| NOVA [10] | 80.12 | 80.39 | 79.05 | 8.21 | 1.63 | 0.88 |
| PyramidFlow [25] | 81.72 | 84.74 | 69.62 | 8.76 | 1.50 | 6.70 |
| **Distilled AR Models** | | | | | | |
| CausVid [55] | 81.20 | 84.05 | 69.80 | 7.56 | 1.22 | 17.0 |
| Reward Forcing [30] | 84.13 | 84.84 | 81.32 | 8.74 | 1.65 | 23.1 |
| Self-Forcing [22] | 83.74 | 84.48 | 80.77 | 9.36 | 1.65 | 17.0 |
| **+ Ours** | **83.79 (+.05)** | **84.51 (+.03)** | **80.92 (+.15)** | **10.72 (+1.36)** | **1.71 (+.06)** | **17.0** |
| LongLive [51] | 83.22 | 83.68 | 81.37 | 9.38 | 1.51 | 20.7 |
| **+ Ours** | **84.93 (+1.71)** | **85.83 (+2.15)** | **81.36 (-.01)** | **11.03 (+1.65)** | **1.64 (+.13)** | **20.7** |
| Causal Forcing [64] | 84.04 | 84.59 | 81.84 | 9.48 | 1.69 | 17.0 |
| **+ Ours** | **84.46 (+.42)** | **85.15 (+.56)** | **81.72 (-.12)** | **10.84 (+1.36)** | **1.80 (+.11)** | **17.0** |

### 2. Long-Video Single-Prompt Generation
* **Evaluation:** VBench-Long protocols, generate 30-second videos.
* **Quantitative Results (Table 2):** Improves performance across long-video benchmarks for all baselines.
* **Qualitative Results (Figure 4):** Yields sharper textures and superior motion coherence over extended durations.

**Table 2: Quantitative results on VBench-Long benchmarks.**
| Method | Total ↑ | Quality ↑ | Semantic ↑ | HPSv3 ↑ | MQ ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: |
| SkyReels-V2 [7] | 75.29 | 80.77 | 53.37 | 8.72 | 1.54 |
| FramePack [56] | 81.95 | 83.61 | 75.32 | 8.94 | 1.58 |
| Self-Forcing [22] | 81.59 | 83.82 | 72.70 | 9.12 | 1.61 |
| **+ Ours** | **82.03** | **84.36** | **72.71** | **10.38** | **1.72** |
| LongLive [51] | 83.52 | 85.44 | 75.82 | 9.21 | 1.48 |
| **+ Ours** | **84.07** | **86.12** | **75.87** | **10.67** | **1.64** |
| Causal Forcing [64] | 82.87 | 84.36 | 76.91 | 9.28 | 1.65 |
| **+ Ours** | **84.24** | **86.18** | **76.48** | **10.52** | **1.74** |

### 3. Long-Video Multi-Prompt Generation
* **Evaluation:** 100 groups of narrative scripts (6 successive 10-second prompts → 60-second videos). Evaluate clip-wise semantic adherence via CLIP scores.
* **Quantitative Results (Table 3):** Improves overall generation quality, visual aesthetics, and long-range motion consistency.
* **Qualitative Results (Figure 6):** Enhances frame-level aesthetics and temporal consistency during complex narrative transitions.

**Table 3: Quantitative evaluation on long video generation (CLIP Scores across intervals).**
| Method | Quality Score ↑ | Consistency Score ↑ | Aesthetic Score ↑ | CLIP Score ↑ (0-60s intervals) |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| | | | | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 |
| SkyReels-V2 [7] | 81.55 | 94.72 | 56.83 | 25.31 | 23.40 | 22.50 | 21.62 | 21.67 | 20.91 |
| FramePack [56] | 84.40 | 96.77 | 59.44 | 26.51 | 22.60 | 22.18 | 21.53 | 21.98 | 21.62 |
| Self-Forcing [22] | 83.94 | 95.74 | 58.45 | 26.24 | 24.87 | 23.46 | 21.92 | 22.05 | 21.07 |
| **+ Ours** | **84.72** | **95.98** | **59.62** | **26.42** | **24.75** | **23.95** | **22.40** | **21.85** | **21.50** |
| LongLive [51] | 84.28 | 96.05 | 59.89 | 26.63 | 25.77 | 24.65 | 23.99 | 24.52 | 24.11 |
| **+ Ours** | **85.15** | **96.16** | **60.75** | **26.80** | **26.15** | **24.45** | **24.55** | **24.30** | **24.65** |
| Causal Forcing [64] | 84.12 | 95.88 | 59.15 | 26.45 | 25.60 | 23.98 | 22.85 | 22.48 | 22.45 |
| **+ Ours** | **84.95** | **95.63** | **60.32** | **26.58** | **25.12** | **23.85** | **23.40** | **23.10** | **22.95** |

### 4. Ablation Studies
* **Streaming Training Scheme (Table 4a):** Clip-level group-wise sampling with detached context achieves best trade-off: reduces memory by ≈2× compared to clip-level full backpropagation while improving HPSv3 and MQ.
* **Reward Design & Regularization (Table 4c, Figure 7a):** Single-reward optimization induces hacking (VQ-only collapses into static frames). Multi-reward formulation (VQ+MQ+TA) prevents overfitting and yields balanced improvements. Selective KL penalty with EMA updates outperforms uniform KL or no KL.
* **Removing Adaptive Weighting (Figure 8b):** DiffusionNFT's adaptive weighting destabilizes distilled AR setting (causes $x_0$ norm explosion and reward collapse). Removing it ensures steady improvements.
* **Impact of $\beta$ (Figure 7b):** $\beta=1$ yields higher visual and motion quality compared to $\beta=0.1$.

**Table 4: Ablation studies on each component.**
**(a) Streaming Training**
| Config | HPSv3 ↑ | MQ ↑ | Mem ↓ |
| :--- | :---: | :---: | :---: |
| Seq + Full BP | OOM | OOM | > 140 |
| Seq + Detach | 10.21 | 1.72 | 96.4 |
| Clip + Full BP | 10.58 | 1.76 | 112.3 |
| Clip+Detach (Ours) |

---

_Markdown view of https://picx.dev/p/lzUWw4, served by PicX — AI-generated visual whiteboard summaries of research papers._