Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Summary (Overview)

Core Contribution: Astrolabe is an efficient online reinforcement learning (RL) framework designed to align distilled autoregressive (AR) video models with human visual preferences without requiring expensive re-distillation or solver-coupled reverse-process optimization.
Key Innovation: Introduces a forward-process RL formulation based on negative-aware fine-tuning, which establishes an implicit policy improvement direction by contrasting positive and negative samples at inference endpoints, avoiding the need for reverse-process unrolling and trajectory storage.
Scalability Solution: Proposes a streaming training scheme with a rolling KV cache and clip-level group-wise sampling, enabling RL alignment for long videos by applying updates only to local windows while conditioning on detached historical context, ensuring long-range coherence with constant memory overhead.
Robustness Enhancement: Integrates a multi-reward objective (Visual Quality, Motion Quality, Text Alignment) stabilized by uncertainty-aware selective regularization and dynamic reference updates to mitigate reward hacking and balance optimization across different quality dimensions.
Empirical Validation: Demonstrates consistent improvements in generation quality (visual aesthetics, motion coherence, text alignment) across multiple distilled AR video models (Self-Forcing, Causal Forcing, LongLive) and various generation settings (short/long, single/multi-prompt) while preserving original inference speed.

Introduction and Theoretical Foundation

Background & Motivation: Recent diffusion models achieve high-quality video synthesis but suffer from prohibitive latency due to multi-step denoising and bidirectional attention, preventing real-time streaming generation. Distilled autoregressive (AR) video models (via Distribution Matching Distillation, DMD) enable efficient streaming inference via KV-caching. However, distillation only mimics the teacher's distribution and lacks optimization for human preference, leading to artifacts and unnatural motion.

Problem: Applying online RL to align these models is challenging. Reward-guided distillation lacks active exploration. Reverse-process RL methods (e.g., DanceGRPO, Flow-GRPO) require log-probability estimation along sampling trajectories, coupling to specific solvers and storing intermediate states, which introduces substantial memory/computation overhead and erodes streaming efficiency.

Goal: Develop an efficient, stable online RL framework for distilled AR video models that avoids these bottlenecks and scales to long videos.

Theoretical Foundation: The work builds upon:

Autoregressive Video Diffusion Models: Factorize joint distribution as $p(x_{1:N}) = \prod_{i=1}^{N} p(x_i | x_{<i})$ . Each conditional is modeled via flow matching: $x_t^i = (1 - t)x_i + t\epsilon_i$ , where $\epsilon_i \sim N(0, I)$ , $t \in [0,1]$ . The model predicts the velocity field $v_\theta$ .
Forward-Process Reinforcement Learning (DiffusionNFT): Avoids reverse-process likelihood estimation by applying rewards directly to the forward process. Uses implicit positive and negative policies defined via interpolation of current and old velocity predictors.

Methodology

Astrolabe combines memory-efficient streaming rollout, online RL optimization (clip-level forward-process RL & streaming long tuning), and reward design/regularization.

1. Memory-Efficient Streaming Rollout

To overcome bottlenecks of temporal credit assignment and memory overhead for long sequences:

Rolling KV Cache with Frame Sinks: Maintains a restricted visual context window $C_n$ $C_{n}$ comprising:
- A frame sink of $S$ permanently retained frames for global semantic context.
- A rolling window of the $L$ most recent frames for local conditioning.
- KV memory remains constant (independent of video length $N$ ).
Clip-level Group-wise Sampling: At step $n$ , using the frozen KV cache of $C_n$ , the model decodes $G$ independent candidate clips in parallel: $x_n^{(i)} \sim \pi_\theta(\cdot|C_n, c), \quad \text{for } i \in \{1, ..., G\}$ Shares context prefix across candidates, reducing rollout cost.

2. Online RL Optimization

Clip-level Forward-Process RL:
- For each candidate $x_n^{(i)}$ , evaluate composite reward $R(x_n^{(i)}, c)$ .
- Compute advantage $A^{(i)} = R(x_n^{(i)}, c) - \frac{1}{G} \sum_{j=1}^{G} R(x_n^{(j)}, c)$ .
- Normalize: $\tilde{r}_i = \text{clip}(A^{(i)} / A_{\max}) / 2 + 0.5$ .
- For distilled model with $T=4$ , timestep $t$ sampled from $T_{\text{distill}}$ .
- Construct noised sample $x_{n}^{t,(i)}$ to predict velocities $v_\theta$ and $v_{\theta_{\text{old}}}$ .
- Implicit Policy Loss: Optimize using $L_{\text{policy}}$ from DiffusionNFT (Eq. 2), but discards the adaptive loss weighting of DiffusionNFT as it triggers gradient explosion under large discretization gaps in distilled AR settings.
$L_{\text{policy}} = \tilde{r} \| v^+ - v_{\text{target}} \|_2^2 + (1 - \tilde{r}) \| v^- - v_{\text{target}} \|_2^2$ where implicit policies are: $v^+ = (1 - \beta) v_{\theta_{\text{old}}} + \beta v_\theta, \quad v^- = (1 + \beta) v_{\theta_{\text{old}}} - \beta v_\theta$ $\beta$ $β$ controls interpolation strength.
Streaming Long Tuning: Simulates long-sequence inference while decoupling gradient computation.
- Perform full forward pass to accumulate KV cache up to target step.
- At active training window $x_n$ , detach KV cache of preceding frames $x_{<n}$ from computation graph (serves as historical context).
- Backpropagate gradients only through the active window. Bounds training memory.

3. Reward Design and Regularization

Multi-reward Formulation: Composite reward integrates:

Visual Quality (VQ): Mean HPSv3 score over top 30% of frames.
Motion Quality (MQ): VideoAlign on grayscale inputs (focuses on motion dynamics).
Text Alignment (TA): VideoAlign on RGB inputs for semantic correspondence.

Uncertainty-Aware Selective KL Penalty:

Quantifies sample uncertainty as rank discrepancy between primary reward model $p$ and $M-1$ auxiliary models: $\Delta_{\text{rank}}^{(i)} = \text{rank}_p^{(i)} - \frac{1}{M-1} \sum_{m \neq p} \text{rank}_m^{(i)}$
Masks risky samples (likely reward hacking) using $M^{(i)} = 1[\Delta_{\text{rank}}^{(i)} > \tau]$ , where $\tau$ is the $(1-\rho)$ -th percentile of positive discrepancies (risk ratio $\rho$ ).
Total objective: $L = L_{\text{policy}} + \lambda_{\text{KL}} L_{\text{KL}}$ applies KL penalty only to masked samples.
Dynamic Reference Updates: Policy $\theta_{\text{old}}$ follows EMA update. Reference policy $\theta_{\text{ref}}$ conditionally resets ( $\theta_{\text{ref}} \gets \theta$ ) when policy deviation surpasses $\tau_{\text{KL}}$ or epochs reach $K_{\max}$ .

Implementation Details:

Base models: Self-Forcing, Causal-Forcing, LongLive.
Parameter-efficient fine-tuning: Low-Rank Adaptation (LoRA) with rank $r=256$ , $\alpha=256$ .
Memory efficiency: Single frozen base model shared for $v_\theta$ and $v_{\theta_{\text{old}}}$ , switch lightweight LoRA during forward pass.
Training: 48 NVIDIA H200 GPUs, 48 prompts/epoch, group size $G=24$ .

Empirical Validation / Results

Experiments validate effectiveness across short/long, single/multi-prompt settings.

1. Short-Video Single-Prompt Generation

Evaluation: VBench protocols (946 standard prompts), augmented prompt test set.
Quantitative Results (Table 1): Astrolabe consistently enhances all Self-Forcing variants, LongLive, and Causal-Forcing.
Key Metrics: Improvements in HPSv3 (aesthetics) and Motion Quality (MQ) while maintaining original inference throughput.
Qualitative Results (Figure 3): Generates videos with sharper textures and superior motion coherence.

Table 1: Quantitative results on VBench benchmarks.

Method	Total ↑	Quality ↑	Semantic ↑	HPSv3 ↑	MQ ↑	Throughput ↑
Diffusion Models
LTX-Video [15]	80.00	82.30	70.79	8.32	1.34	8.98
Wan2.1 [42]	84.26	85.30	80.09	9.26	1.62	0.78
AR Models
SkyReels-V2 [7]	82.67	84.70	74.53	9.08	1.59	0.49
MAGI-1 [40]	79.18	82.04	67.74	7.95	1.52	0.19
NOVA [10]	80.12	80.39	79.05	8.21	1.63	0.88
PyramidFlow [25]	81.72	84.74	69.62	8.76	1.50	6.70
Distilled AR Models
CausVid [55]	81.20	84.05	69.80	7.56	1.22	17.0
Reward Forcing [30]	84.13	84.84	81.32	8.74	1.65	23.1
Self-Forcing [22]	83.74	84.48	80.77	9.36	1.65	17.0
+ Ours	83.79 (+.05)	84.51 (+.03)	80.92 (+.15)	10.72 (+1.36)	1.71 (+.06)	17.0
LongLive [51]	83.22	83.68	81.37	9.38	1.51	20.7
+ Ours	84.93 (+1.71)	85.83 (+2.15)	81.36 (-.01)	11.03 (+1.65)	1.64 (+.13)	20.7
Causal Forcing [64]	84.04	84.59	81.84	9.48	1.69	17.0
+ Ours	84.46 (+.42)	85.15 (+.56)	81.72 (-.12)	10.84 (+1.36)	1.80 (+.11)	17.0

2. Long-Video Single-Prompt Generation

Evaluation: VBench-Long protocols, generate 30-second videos.
Quantitative Results (Table 2): Improves performance across long-video benchmarks for all baselines.
Qualitative Results (Figure 4): Yields sharper textures and superior motion coherence over extended durations.

Table 2: Quantitative results on VBench-Long benchmarks.

Method	Total ↑	Quality ↑	Semantic ↑	HPSv3 ↑	MQ ↑
SkyReels-V2 [7]	75.29	80.77	53.37	8.72	1.54
FramePack [56]	81.95	83.61	75.32	8.94	1.58
Self-Forcing [22]	81.59	83.82	72.70	9.12	1.61
+ Ours	82.03	84.36	72.71	10.38	1.72
LongLive [51]	83.52	85.44	75.82	9.21	1.48
+ Ours	84.07	86.12	75.87	10.67	1.64
Causal Forcing [64]	82.87	84.36	76.91	9.28	1.65
+ Ours	84.24	86.18	76.48	10.52	1.74

3. Long-Video Multi-Prompt Generation

Evaluation: 100 groups of narrative scripts (6 successive 10-second prompts → 60-second videos). Evaluate clip-wise semantic adherence via CLIP scores.
Quantitative Results (Table 3): Improves overall generation quality, visual aesthetics, and long-range motion consistency.
Qualitative Results (Figure 6): Enhances frame-level aesthetics and temporal consistency during complex narrative transitions.

Table 3: Quantitative evaluation on long video generation (CLIP Scores across intervals). | Method | Quality Score ↑ | Consistency Score ↑ | Aesthetic Score ↑ | CLIP Score ↑ (0-60s intervals) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | | 0-10 | 10-20 | 20-30 | 30-40 | 40-50 | 50-60 | | SkyReels-V2 [7] | 81.55 | 94.72 | 56.83 | 25.31 | 23.40 | 22.50 | 21.62 | 21.67 | 20.91 | | FramePack [56] | 84.40 | 96.77 | 59.44 | 26.51 | 22.60 | 22.18 | 21.53 | 21.98 | 21.62 | | Self-Forcing [22] | 83.94 | 95.74 | 58.45 | 26.24 | 24.87 | 23.46 | 21.92 | 22.05 | 21.07 | | + Ours | 84.72 | 95.98 | 59.62 | 26.42 | 24.75 | 23.95 | 22.40 | 21.85 | 21.50 | | LongLive [51] | 84.28 | 96.05 | 59.89 | 26.63 | 25.77 | 24.65 | 23.99 | 24.52 | 24.11 | | + Ours | 85.15 | 96.16 | 60.75 | 26.80 | 26.15 | 24.45 | 24.55 | 24.30 | 24.65 | | Causal Forcing [64] | 84.12 | 95.88 | 59.15 | 26.45 | 25.60 | 23.98 | 22.85 | 22.48 | 22.45 | | + Ours | 84.95 | 95.63 | 60.32 | 26.58 | 25.12 | 23.85 | 23.40 | 23.10 | 22.95 |

4. Ablation Studies

Streaming Training Scheme (Table 4a): Clip-level group-wise sampling with detached context achieves best trade-off: reduces memory by ≈2× compared to clip-level full backpropagation while improving HPSv3 and MQ.
Reward Design & Regularization (Table 4c, Figure 7a): Single-reward optimization induces hacking (VQ-only collapses into static frames). Multi-reward formulation (VQ+MQ+TA) prevents overfitting and yields balanced improvements. Selective KL penalty with EMA updates outperforms uniform KL or no KL.
Removing Adaptive Weighting (Figure 8b): DiffusionNFT's adaptive weighting destabilizes distilled AR setting (causes $x_0$ norm explosion and reward collapse). Removing it ensures steady improvements.
Impact of $\beta$ (Figure 7b): $\beta=1$ yields higher visual and motion quality compared to $\beta=0.1$ .

Table 4: Ablation studies on each component. (a) Streaming Training

Config	HPSv3 ↑	MQ ↑	Mem ↓
Seq + Full BP	OOM	OOM	> 140
Seq + Detach	10.21	1.72	96.4
Clip + Full BP	10.58	1.76	112.3
Clip+Detach (Ours)