LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
Summary (Overview)
- NVFP4-based Co-design: Introduces an end-to-end NVFP4 (4-bit floating point) parallel infrastructure for both training and inference of long video generation, addressing speed and memory bottlenecks.
- Balanced SP for Training: Proposes Balanced Sequence Parallelism (SP), which co-designs efficient teacher-forcing with SP execution by pairing clean-history and noisy-target temporal chunks on each GPU, enabling load balancing and SP-aware VAE encoding.
- Clean Algorithmic Pipeline: Leverages high-quality infrastructure to enable a direct, single-stage fine-tuning of a bidirectional diffusion model into a long, multi-shot, interactive autoregressive (AR) model, bypassing complex multi-stage processes (ODE initialization, DMD) used in prior work.
- Efficient Inference System: Enables W4A4 NVFP4 inference, quantizes the KV cache to NVFP4 for memory savings, and boosts end-to-end throughput with asynchronous streaming VAE decoding and parallel dequantization.
- Strong Performance & Efficiency: Achieves up to 2.15× training speedup and 1.84× inference speedup. The 5B parameter model (LongLive-2.0-5B) attains 45.7 FPS for 720p video generation while maintaining strong benchmark scores on VBench and VBench-Long.
Introduction and Theoretical Foundation
Long video generation faces prohibitive GPU memory consumption and low computational efficiency in both training (due to massive datasets) and inference (due to real-time latency demands). Existing work focuses on algorithmic design but largely neglects infrastructure optimization.
Limitations of Existing Work:
- Infrastructure: Lack of joint co-design between training and inference. Inference quantization methods typically use Post-Training Quantization (PTQ), leading to misalignment and suboptimal performance.
- Algorithm: Prevailing training pipelines (e.g., Self -Forcing, Causal-Forcing) are overly complicated, requiring multi-stage processes like ODE initialization and Distribution Matching Distillation (DMD).
Theoretical Basis: The work builds upon efficient parallel teacher-forcing formulations for autoregressive (AR) diffusion models. It trains a chunk-level AR model that denoises the current noisy chunk conditioned on clean generated history, using a block-sparse AR mask to supervise all noisy chunks in one forward pass. The key innovation is co-designing this AR training layout with sequence-parallel execution.
Methodology
The framework co-designs algorithms with NVFP4-based parallel infrastructure for both training and inference (Figure 2).
1. Training Infrastructure
Sequence-Parallel AR Training (Balanced SP):
Traditional SP applied to the concatenated DiT sequence [z_clean; z_noisy] creates workload imbalance and replicates VAE encoding. Balanced SP assigns each GPU the clean and noisy latents from the same temporal chunk.
- Each rank
pprepares its local clean latent chunk and applies noise locally to get the matched noisy chunk. The owned DiT sequence is: wherePis SP group size,Lis total token length,His number of heads, anddis head dimension. - SP-aware VAE Encoding: Each rank encodes only its local raw-video chunk
X^{(p)}plus a left halo covering the VAE's temporal receptive field, then discards the halo. This reduces per-rank VAE cost fromO(F)toO(F/P + h)forFlatent frames and halo sizeh. - Natural Teacher-Forcing Mask: After Ulysses All-to-All communication, the global token order becomes interleaved
[z_clean^(0), z_noisy^(0), ..., z_clean^(P-1), z_noisy^(P -1)]. Instead of permuting back to[all clean; all noisy], the AR mask is constructed directly on this communication-native order and compiled withflex_attention.
NVFP4 Training:
NVFP4 uses a 4-bit floating-point (E2M1) format with hierarchical scaling. A dequantized tensor \hat{X} is represented as:
where \alpha_{FP8} is a block-wise (16 elements) scale in FP8 E4M3 and \alpha_{FP8} is a tensor-wise global scale in FP32. NVFP4 accelerates GEMMs and reduces memory, with gains increasing as video length grows.
- Multi-Shot AR NVFP4 Training: The AR generator is trained on real multi-shot data using end-to-end NVFP4 quantization for linear layers (2D block scaling for weights, 1D for activations/gradients). Sensitive operations (reductions, normalization, optimizer states) remain in higher precision. Random Hadamard Transform (RHT) stabilizes the weight-gradient GEMM path.
- Few-step Distillation in NVFP4: For DMD, both teacher and student operate in W4A4 NVFP4. The backbone is frozen, and only LoRA adapters are optimized:
where
\mathbf{W}_0is the pretrained weight,Q_{search}is scale-search-based NVFP4 quantization, and\mathbf{A}, \mathbf{B}are trainable low-rank matrices of rankr.
2. Inference Infrastructure
NVFP4 Inference: The generator executes in W4A4 NVFP4, either as a quantized backbone with a separate LoRA branch or as a merged model. This reduces memory traffic and offers up to 4× theoretical GEMM throughput speedup.
Parallel KV Quantization: The KV cache is quantized at the frame-chunk level. For layer \ell, cached chunk c is:
which is reshaped to \mathbb{R}^{(T_c H) \times d} and quantized independently with NVFP4 micro-block scaling. A simple K-smoothing is applied first:
This achieves close to a 3.6× KV-cache compression ratio. A customized parallel dequantization kernel enables efficient in-window reconstruction.
Asynchronous Streaming Decoding: A dedicated GPU handles VAE decoding asynchronously alongside the DiT SP cluster. While the DiT cluster denoises chunk c+1, the VAE node decodes chunk c. This reduces end-to-end latency from C(t_DiT + t_VAE) to approximately C \cdot t_DiT + t_VAE for C chunks and hides decoding overhead.
3. Algorithm-level Designs
Multi-Shot Interactive AR Training: Uses chunk-level generation where each temporal latent chunk Z_i is bound to an individual text prompt T_i (CrossAttn(Z_i, T_i)). This supports prompt switches at chunk boundaries and interactive editing.
Multi-Shot Attention Sink: For streaming inference, a novel sink mechanism uses two cooperating anchor sets (Figure 7):
- Global Sink (
\mathcal{A}_g): FirstS_gframes of the video, fixed to preserve global identity. - Shot-Level Sink (
\mathcal{A}_s): FirstS_sframes of the current shot, re-bound at every scene cut to maintain local temporal coherence. The effective key/value set at steptis\mathcal{K}_{eff}(t) = \mathcal{A}_g \cup \mathcal{A}_s \cup KV[t-W, t). This integrates seamlessly with chunk-wise prompting for interactive generation.
Empirical Validation / Results
1. Training Efficiency
Table 1: AR training efficiency of LongLive-2.0. (Red subscripts denote speedup over BF16+SP)
| Input Length | BF16 w/o SP | BF16 w/ SP | BF16 Balanced SP | NVFP4 Balanced SP | Speedup |
|---|---|---|---|---|---|
| 16s | 75.3 | 52.2 | 45.8 | 40.1 | 1.3× |
| 32s | 202.7 | 162.7 | 136.8 | 119.3 | 1.4× |
| 64s | OOM | 1372.9 | 1196.5 | 639.5 | 2.1× |
- NVFP4+Balanced SP provides the fastest training, with gains most pronounced at long sequences (2.15× speedup at 64s).
Table 2: Progressively quantizing models in DMD training. (Peak per-GPU memory)
| Generator | Real | Fake | Peak Memory | Ratio ↓ |
|---|---|---|---|---|
| BF16 | BF16 | BF16 | 70.5 GB | - |
| NVFP4 | BF16 | BF16 | 63.3 GB | 0.90× |
| NVFP4+LoRA | NVFP4 | BF16 | 57.2 GB | 0.81× |
| NVFP4+LoRA | NVFP4 | NVFP4+LoRA | 49.0 GB | 0.69× |
- Progressive NVFP4 conversion reduces peak memory by 21.5 GB per GPU (to 69% of baseline).
2. Inference Efficiency
Table数和3: Inference efficiency under progressively enabled optimizations. (On NVIDIA GB200)
| Inference Settings | FPS ↑ | 16s E2E Gen. (s) / Mem. (GB) | 32s E2E Gen. (s) / Mem. (GB) | 64s E2E Gen. (s) / Mem. (GB) |
|---|---|---|---|---|
| BF16 | 24.8 | 26.6 / 36.4 | 53.2 / 36.4 | 112.9 / 36.4 |
| NVFP4 | 32.0 | 22.9 / 29.7 | 46.6 / 29.7 | 96.0 / 29.7 |
| + NVFP4 KV Cache | 29.7 | 23.8 / 19.4 | 48.9 / 19.4 | 99.5 / 19.4 |
| + Async Decoding | 29.7 | 15.9 / 19.4 | 29.1 / 19.4 | 57.6 / 19.4 |
| 3 Steps | 35.2 | 12.7 / 19.4 | 23.2 / 19.4 | 46.0 / 19.4 |
| 2 Steps | 45.7 | 11.2 / 19.4 | 19.2 / 19.4 | 36.3 / 19.4 |
- NVFP4 with KV cache quantization reduces peak memory from 29.7 GB to 19.4 GB.
- Asynchronous decoding significantly reduces end-to-end latency.
- The 2-step system achieves 45.7 FPS with a 19.4 GB memory footprint for 64s videos.
3. Performance Evaluation
Table 4: Comparison on VBench (short video).
| Model | Precision | #Steps | #Params | Resolution | Throughput (FPS) ↑ | Total Score ↑ |
|---|---|---|---|---|---|---|
| Self-Forcing [26] | BF16 | 4 | 1.3B | 832×480 | 21.2 | 84.31 |
| LongLive [64] | BF16 | 4 | 1.3B | 832×480 | 20.7 | 84.87 |
| LongLive-2.0 | BF16 | 4 | 5B | 1280×720 | 24.8 | 85.06 |
| LongLive-2.0 | NVFP4 | 4 | 5B | 1280×720 | 29.7 | 84.51 |
| LongLive-2.0 | NVFP4 | 2 | 5B | 1280×720 | 45.7 | 83.14 |
- LongLive-2.0 achieves the highest throughput at 720p resolution.
- NVFP4 with 2-step denoising enables real-time generation at 45.7 FPS.
Table 5: Comparison on VBench-Long (60s video). (Avg. Rank computed over 6 metrics; lower is better)
| Method | Avg. Rank ↓ | Subject Consistency ↑ | Background Consistency ↑ |
|---|---|---|---|
| Self-Forcing [26] | 5.83 | 95.84 | 95.27 |
| LongLive [64] | 4.17 | 97.13 | 95.89 |
| LongLive-2.0 | 3.67 | 97.48 | 97.00 |
| → LongLive-2.0 NVFP4 | 3.83 | 97.62 | 96.97 |
- LongLive-2.0 achieves the best average rank, demonstrating strong long-range generation ability, with top scores in subject and background consistency.
Theoretical and Practical Implications
Theoretical Implications:
- Demonstrates that strong infrastructure can simplify algorithmic pipelines. The co-design of Balanced SP and NVFP4 enables direct, single-stage AR training, challenging the necessity of complex multi-stage distillation pipelines prevalent in the field.
- Establishes the viability and advantages of end-to-end low-precision (FP4) training for large-scale generative video models, aligning training and inference precision to avoid quality degradation.
Practical Implications:
- Significant Efficiency Gains: Provides up to 2.15× training and 1.84× inference speedups with substantial memory reduction, lowering the computational barrier for long video generation research and deployment.
- Real-Time High-Resolution Generation: Enables real-time (45.7 FPS) generation of 720p long videos, making interactive applications more feasible.
- Hardware-Aware Deployment: Offers a full NVFP4 stack for Blackwell GPUs and provides Sequence Parallelism inference as an efficient alternative for non-Blackwell architectures.
- System-Level Optimization: Highlights the importance of end-to-end system optimization, including asynchronous decoding and KV cache quantization, for practical throughput.
Conclusion
LongLive-2.0 presents a comprehensive algorithm-infrastructure co-design system that addresses the efficiency bottlenecks in long video generation. Its core contributions are:
- Balanced SP for efficient, load-balanced sequence-parallel AR training.
- End-to-end NVFP4 integration for training and inference, reducing memory and accelerating computation.
- A clean training pipeline that directly fine-tunes models for long, multi-shot, interactive generation.
- An inference system with W4A4 NVFP4, quantized KV cache, and asynchronous decoding for high throughput.
The system achieves state-of-the-art efficiency (45.7 FPS for a 5B model) while maintaining strong benchmark performance. It is the first end-to-end NVFP4 training and inference system tailored for long video generation.
Limitations: The acceleration from NVFP4 inference is hardware-dependent, requiring Blackwell GPUs (e.g., GB200) for native support. On non-Blackwell GPUs, SP inference is used as an alternative acceleration path. Broader Impacts: Reduces computational costs and resource thresholds for video generation research. The infrastructure itself involves no negative social implications, sharing the ethical considerations of existing video generation models.