LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Summary (Overview)

  • NVFP4-based Co-design: Introduces an end-to-end NVFP4 (4-bit floating point) parallel infrastructure for both training and inference of long video generation, addressing speed and memory bottlenecks.
  • Balanced SP for Training: Proposes Balanced Sequence Parallelism (SP), which co-designs efficient teacher-forcing with SP execution by pairing clean-history and noisy-target temporal chunks on each GPU, enabling load balancing and SP-aware VAE encoding.
  • Clean Algorithmic Pipeline: Leverages high-quality infrastructure to enable a direct, single-stage fine-tuning of a bidirectional diffusion model into a long, multi-shot, interactive autoregressive (AR) model, bypassing complex multi-stage processes (ODE initialization, DMD) used in prior work.
  • Efficient Inference System: Enables W4A4 NVFP4 inference, quantizes the KV cache to NVFP4 for memory savings, and boosts end-to-end throughput with asynchronous streaming VAE decoding and parallel dequantization.
  • Strong Performance & Efficiency: Achieves up to 2.15× training speedup and 1.84× inference speedup. The 5B parameter model (LongLive-2.0-5B) attains 45.7 FPS for 720p video generation while maintaining strong benchmark scores on VBench and VBench-Long.

Introduction and Theoretical Foundation

Long video generation faces prohibitive GPU memory consumption and low computational efficiency in both training (due to massive datasets) and inference (due to real-time latency demands). Existing work focuses on algorithmic design but largely neglects infrastructure optimization.

Limitations of Existing Work:

  1. Infrastructure: Lack of joint co-design between training and inference. Inference quantization methods typically use Post-Training Quantization (PTQ), leading to misalignment and suboptimal performance.
  2. Algorithm: Prevailing training pipelines (e.g., Self -Forcing, Causal-Forcing) are overly complicated, requiring multi-stage processes like ODE initialization and Distribution Matching Distillation (DMD).

Theoretical Basis: The work builds upon efficient parallel teacher-forcing formulations for autoregressive (AR) diffusion models. It trains a chunk-level AR model that denoises the current noisy chunk conditioned on clean generated history, using a block-sparse AR mask to supervise all noisy chunks in one forward pass. The key innovation is co-designing this AR training layout with sequence-parallel execution.

Methodology

The framework co-designs algorithms with NVFP4-based parallel infrastructure for both training and inference (Figure 2).

1. Training Infrastructure

Sequence-Parallel AR Training (Balanced SP): Traditional SP applied to the concatenated DiT sequence [z_clean; z_noisy] creates workload imbalance and replicates VAE encoding. Balanced SP assigns each GPU the clean and noisy latents from the same temporal chunk.

  • Each rank p prepares its local clean latent chunk and applies noise locally to get the matched noisy chunk. The owned DiT sequence is: z(p)=[zclean(p),znoisy(p)]RLP×H×d\mathbf{z}^{(p)} = [\mathbf{z}^{(p)}_{clean}, \mathbf{z}^{(p)}_{noisy}] \in \mathbb{R}^{\frac{L}{P} \times H \times d} where P is SP group size, L is total token length, H is number of heads, and d is head dimension.
  • SP-aware VAE Encoding: Each rank encodes only its local raw-video chunk X^{(p)} plus a left halo covering the VAE's temporal receptive field, then discards the halo. This reduces per-rank VAE cost from O(F) to O(F/P + h) for F latent frames and halo size h.
  • Natural Teacher-Forcing Mask: After Ulysses All-to-All communication, the global token order becomes interleaved [z_clean^(0), z_noisy^(0), ..., z_clean^(P-1), z_noisy^(P -1)]. Instead of permuting back to [all clean; all noisy], the AR mask is constructed directly on this communication-native order and compiled with flex_attention.

NVFP4 Training: NVFP4 uses a 4-bit floating-point (E2M1) format with hierarchical scaling. A dequantized tensor \hat{X} is represented as:

X^=X^FP4αFP8αFP32,X^FP4FE2M1\hat{X} = \hat{X}_{FP4} \cdot \alpha_{FP8} \cdot \alpha_{FP32}, \quad \hat{X}_{FP4} \in \mathcal{F}_{E2M1}

where \alpha_{FP8} is a block-wise (16 elements) scale in FP8 E4M3 and \alpha_{FP8} is a tensor-wise global scale in FP32. NVFP4 accelerates GEMMs and reduces memory, with gains increasing as video length grows.

  • Multi-Shot AR NVFP4 Training: The AR generator is trained on real multi-shot data using end-to-end NVFP4 quantization for linear layers (2D block scaling for weights, 1D for activations/gradients). Sensitive operations (reductions, normalization, optimizer states) remain in higher precision. Random Hadamard Transform (RHT) stabilizes the weight-gradient GEMM path.
  • Few-step Distillation in NVFP4: For DMD, both teacher and student operate in W4A4 NVFP4. The backbone is frozen, and only LoRA adapters are optimized: WDequant(Qsearch(W0))+ΔW,ΔW=αLoRArBA\mathbf{W} \simeq \text{Dequant}(Q_{search}(\mathbf{W}_0)) + \Delta\mathbf{W}, \quad \Delta\mathbf{W} = \frac{\alpha_{LoRA}}{r}\mathbf{BA} where \mathbf{W}_0 is the pretrained weight, Q_{search} is scale-search-based NVFP4 quantization, and \mathbf{A}, \mathbf{B} are trainable low-rank matrices of rank r.

2. Inference Infrastructure

NVFP4 Inference: The generator executes in W4A4 NVFP4, either as a quantized backbone with a separate LoRA branch or as a merged model. This reduces memory traffic and offers up to 4× theoretical GEMM throughput speedup.

Parallel KV Quantization: The KV cache is quantized at the frame-chunk level. For layer \ell, cached chunk c is:

K,c,V,cRTc×H×d\mathbf{K}^{\ell,c}, \mathbf{V}^{\ell,c} \in \mathbb{R}^{T_c \times H \times d}

which is reshaped to \mathbb{R}^{(T_c H) \times d} and quantized independently with NVFP4 micro-block scaling. A simple K-smoothing is applied first:

Kˉ,c[t,h,:]=K,c[t,h,:]1du=1dK,c[t,h,u]\bar{\mathbf{K}}^{\ell,c}[t, h, :] = \mathbf{K}^{\ell,c}[t, h, :] - \frac{1}{d}\sum_{u=1}^{d} \mathbf{K}^{\ell,c}[t, h, u]

This achieves close to a 3.6× KV-cache compression ratio. A customized parallel dequantization kernel enables efficient in-window reconstruction.

Asynchronous Streaming Decoding: A dedicated GPU handles VAE decoding asynchronously alongside the DiT SP cluster. While the DiT cluster denoises chunk c+1, the VAE node decodes chunk c. This reduces end-to-end latency from C(t_DiT + t_VAE) to approximately C \cdot t_DiT + t_VAE for C chunks and hides decoding overhead.

3. Algorithm-level Designs

Multi-Shot Interactive AR Training: Uses chunk-level generation where each temporal latent chunk Z_i is bound to an individual text prompt T_i (CrossAttn(Z_i, T_i)). This supports prompt switches at chunk boundaries and interactive editing.

Multi-Shot Attention Sink: For streaming inference, a novel sink mechanism uses two cooperating anchor sets (Figure 7):

  • Global Sink (\mathcal{A}_g): First S_g frames of the video, fixed to preserve global identity.
  • Shot-Level Sink (\mathcal{A}_s): First S_s frames of the current shot, re-bound at every scene cut to maintain local temporal coherence. The effective key/value set at step t is \mathcal{K}_{eff}(t) = \mathcal{A}_g \cup \mathcal{A}_s \cup KV[t-W, t). This integrates seamlessly with chunk-wise prompting for interactive generation.

Empirical Validation / Results

1. Training Efficiency

Table 1: AR training efficiency of LongLive-2.0. (Red subscripts denote speedup over BF16+SP)

Input LengthBF16 w/o SPBF16 w/ SPBF16 Balanced SPNVFP4 Balanced SPSpeedup
16s75.352.245.840.11.3×
32s202.7162.7136.8119.31.4×
64sOOM1372.91196.5639.52.1×
  • NVFP4+Balanced SP provides the fastest training, with gains most pronounced at long sequences (2.15× speedup at 64s).

Table 2: Progressively quantizing models in DMD training. (Peak per-GPU memory)

GeneratorRealFakePeak MemoryRatio ↓
BF16BF16BF1670.5 GB-
NVFP4BF16BF1663.3 GB0.90×
NVFP4+LoRANVFP4BF1657.2 GB0.81×
NVFP4+LoRANVFP4NVFP4+LoRA49.0 GB0.69×
  • Progressive NVFP4 conversion reduces peak memory by 21.5 GB per GPU (to 69% of baseline).

2. Inference Efficiency

Table数和3: Inference efficiency under progressively enabled optimizations. (On NVIDIA GB200)

Inference SettingsFPS ↑16s E2E Gen. (s) / Mem. (GB)32s E2E Gen. (s) / Mem. (GB)64s E2E Gen. (s) / Mem. (GB)
BF1624.826.6 / 36.453.2 / 36.4112.9 / 36.4
NVFP432.022.9 / 29.746.6 / 29.796.0 / 29.7
+ NVFP4 KV Cache29.723.8 / 19.448.9 / 19.499.5 / 19.4
+ Async Decoding29.715.9 / 19.429.1 / 19.457.6 / 19.4
3 Steps35.212.7 / 19.423.2 / 19.446.0 / 19.4
2 Steps45.711.2 / 19.419.2 / 19.436.3 / 19.4
  • NVFP4 with KV cache quantization reduces peak memory from 29.7 GB to 19.4 GB.
  • Asynchronous decoding significantly reduces end-to-end latency.
  • The 2-step system achieves 45.7 FPS with a 19.4 GB memory footprint for 64s videos.

3. Performance Evaluation

Table 4: Comparison on VBench (short video).

ModelPrecision#Steps#ParamsResolutionThroughput (FPS) ↑Total Score ↑
Self-Forcing [26]BF1641.3B832×48021.284.31
LongLive [64]BF1641.3B832×48020.784.87
LongLive-2.0BF1645B1280×72024.885.06
LongLive-2.0NVFP445B1280×72029.784.51
LongLive-2.0NVFP425B1280×72045.783.14
  • LongLive-2.0 achieves the highest throughput at 720p resolution.
  • NVFP4 with 2-step denoising enables real-time generation at 45.7 FPS.

Table 5: Comparison on VBench-Long (60s video). (Avg. Rank computed over 6 metrics; lower is better)

MethodAvg. Rank ↓Subject Consistency ↑Background Consistency ↑
Self-Forcing [26]5.8395.8495.27
LongLive [64]4.1797.1395.89
LongLive-2.03.6797.4897.00
→ LongLive-2.0 NVFP43.8397.6296.97
  • LongLive-2.0 achieves the best average rank, demonstrating strong long-range generation ability, with top scores in subject and background consistency.

Theoretical and Practical Implications

Theoretical Implications:

  1. Demonstrates that strong infrastructure can simplify algorithmic pipelines. The co-design of Balanced SP and NVFP4 enables direct, single-stage AR training, challenging the necessity of complex multi-stage distillation pipelines prevalent in the field.
  2. Establishes the viability and advantages of end-to-end low-precision (FP4) training for large-scale generative video models, aligning training and inference precision to avoid quality degradation.

Practical Implications:

  1. Significant Efficiency Gains: Provides up to 2.15× training and 1.84× inference speedups with substantial memory reduction, lowering the computational barrier for long video generation research and deployment.
  2. Real-Time High-Resolution Generation: Enables real-time (45.7 FPS) generation of 720p long videos, making interactive applications more feasible.
  3. Hardware-Aware Deployment: Offers a full NVFP4 stack for Blackwell GPUs and provides Sequence Parallelism inference as an efficient alternative for non-Blackwell architectures.
  4. System-Level Optimization: Highlights the importance of end-to-end system optimization, including asynchronous decoding and KV cache quantization, for practical throughput.

Conclusion

LongLive-2.0 presents a comprehensive algorithm-infrastructure co-design system that addresses the efficiency bottlenecks in long video generation. Its core contributions are:

  • Balanced SP for efficient, load-balanced sequence-parallel AR training.
  • End-to-end NVFP4 integration for training and inference, reducing memory and accelerating computation.
  • A clean training pipeline that directly fine-tunes models for long, multi-shot, interactive generation.
  • An inference system with W4A4 NVFP4, quantized KV cache, and asynchronous decoding for high throughput.

The system achieves state-of-the-art efficiency (45.7 FPS for a 5B model) while maintaining strong benchmark performance. It is the first end-to-end NVFP4 training and inference system tailored for long video generation.

Limitations: The acceleration from NVFP4 inference is hardware-dependent, requiring Blackwell GPUs (e.g., GB200) for native support. On non-Blackwell GPUs, SP inference is used as an alternative acceleration path. Broader Impacts: Reduces computational costs and resource thresholds for video generation research. The infrastructure itself involves no negative social implications, sharing the ethical considerations of existing video generation models.