LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

Summary (Overview)

NVFP4-based Co-design: Introduces an end-to-end NVFP4 (4-bit floating point) parallel infrastructure for both training and inference of long video generation, addressing speed and memory bottlenecks.
Balanced SP for Training: Proposes Balanced Sequence Parallelism (SP), which co-designs efficient teacher-forcing with SP execution by pairing clean-history and noisy-target temporal chunks on each GPU, enabling load balancing and SP-aware VAE encoding.
Clean Algorithmic Pipeline: Leverages high-quality infrastructure to enable a direct, single-stage fine-tuning of a bidirectional diffusion model into a long, multi-shot, interactive autoregressive (AR) model, bypassing complex multi-stage processes (ODE initialization, DMD) used in prior work.
Efficient Inference System: Enables W4A4 NVFP4 inference, quantizes the KV cache to NVFP4 for memory savings, and boosts end-to-end throughput with asynchronous streaming VAE decoding and parallel dequantization.
Strong Performance & Efficiency: Achieves up to 2.15× training speedup and 1.84× inference speedup. The 5B parameter model (LongLive-2.0-5B) attains 45.7 FPS for 720p video generation while maintaining strong benchmark scores on VBench and VBench-Long.

Introduction and Theoretical Foundation

Long video generation faces prohibitive GPU memory consumption and low computational efficiency in both training (due to massive datasets) and inference (due to real-time latency demands). Existing work focuses on algorithmic design but largely neglects infrastructure optimization.

Limitations of Existing Work:

Infrastructure: Lack of joint co-design between training and inference. Inference quantization methods typically use Post-Training Quantization (PTQ), leading to misalignment and suboptimal performance.
Algorithm: Prevailing training pipelines (e.g., Self -Forcing, Causal-Forcing) are overly complicated, requiring multi-stage processes like ODE initialization and Distribution Matching Distillation (DMD).

Theoretical Basis: The work builds upon efficient parallel teacher-forcing formulations for autoregressive (AR) diffusion models. It trains a chunk-level AR model that denoises the current noisy chunk conditioned on clean generated history, using a block-sparse AR mask to supervise all noisy chunks in one forward pass. The key innovation is co-designing this AR training layout with sequence-parallel execution.

Methodology

The framework co-designs algorithms with NVFP4-based parallel infrastructure for both training and inference (Figure 2).

1. Training Infrastructure

Sequence-Parallel AR Training (Balanced SP): Traditional SP applied to the concatenated DiT sequence [z_clean; z_noisy] creates workload imbalance and replicates VAE encoding. Balanced SP assigns each GPU the clean and noisy latents from the same temporal chunk.

Each rank p prepares its local clean latent chunk and applies noise locally to get the matched noisy chunk. The owned DiT sequence is: $\mathbf{z}^{(p)} = [\mathbf{z}^{(p)}_{clean}, \mathbf{z}^{(p)}_{noisy}] \in \mathbb{R}^{\frac{L}{P} \times H \times d}$ where P is SP group size, L is total token length, H is number of heads, and d is head dimension.
SP-aware VAE Encoding: Each rank encodes only its local raw-video chunk X^{(p)} plus a left halo covering the VAE's temporal receptive field, then discards the halo. This reduces per-rank VAE cost from O(F) to O(F/P + h) for F latent frames and halo size h.
Natural Teacher-Forcing Mask: After Ulysses All-to-All communication, the global token order becomes interleaved [z_clean^(0), z_noisy^(0), ..., z_clean^(P-1), z_noisy^(P -1)]. Instead of permuting back to [all clean; all noisy], the AR mask is constructed directly on this communication-native order and compiled with flex_attention.

NVFP4 Training: NVFP4 uses a 4-bit floating-point (E2M1) format with hierarchical scaling. A dequantized tensor \hat{X} is represented as:

\hat{X} = \hat{X}_{FP4} \cdot \alpha_{FP8} \cdot \alpha_{FP32}, \quad \hat{X}_{FP4} \in \mathcal{F}_{E2M1}

where \alpha_{FP8} is a block-wise (16 elements) scale in FP8 E4M3 and \alpha_{FP8} is a tensor-wise global scale in FP32. NVFP4 accelerates GEMMs and reduces memory, with gains increasing as video length grows.

Multi-Shot AR NVFP4 Training: The AR generator is trained on real multi-shot data using end-to-end NVFP4 quantization for linear layers (2D block scaling for weights, 1D for activations/gradients). Sensitive operations (reductions, normalization, optimizer states) remain in higher precision. Random Hadamard Transform (RHT) stabilizes the weight-gradient GEMM path.
Few-step Distillation in NVFP4: For DMD, both teacher and student operate in W4A4 NVFP4. The backbone is frozen, and only LoRA adapters are optimized: $\mathbf{W} \simeq \text{Dequant}(Q_{search}(\mathbf{W}_0)) + \Delta\mathbf{W}, \quad \Delta\mathbf{W} = \frac{\alpha_{LoRA}}{r}\mathbf{BA}$ where \mathbf{W}_0 is the pretrained weight, Q_{search} is scale-search-based NVFP4 quantization, and \mathbf{A}, \mathbf{B} are trainable low-rank matrices of rank r.

2. Inference Infrastructure

NVFP4 Inference: The generator executes in W4A4 NVFP4, either as a quantized backbone with a separate LoRA branch or as a merged model. This reduces memory traffic and offers up to 4× theoretical GEMM throughput speedup.

Parallel KV Quantization: The KV cache is quantized at the frame-chunk level. For layer \ell, cached chunk c is:

\mathbf{K}^{\ell,c}, \mathbf{V}^{\ell,c} \in \mathbb{R}^{T_c \times H \times d}

which is reshaped to \mathbb{R}^{(T_c H) \times d} and quantized independently with NVFP4 micro-block scaling. A simple K-smoothing is applied first:

\bar{\mathbf{K}}^{\ell,c}[t, h, :] = \mathbf{K}^{\ell,c}[t, h, :] - \frac{1}{d}\sum_{u=1}^{d} \mathbf{K}^{\ell,c}[t, h, u]

This achieves close to a 3.6× KV-cache compression ratio. A customized parallel dequantization kernel enables efficient in-window reconstruction.

Asynchronous Streaming Decoding: A dedicated GPU handles VAE decoding asynchronously alongside the DiT SP cluster. While the DiT cluster denoises chunk c+1, the VAE node decodes chunk c. This reduces end-to-end latency from C(t_DiT + t_VAE) to approximately C \cdot t_DiT + t_VAE for C chunks and hides decoding overhead.

3. Algorithm-level Designs

Multi-Shot Interactive AR Training: Uses chunk-level generation where each temporal latent chunk Z_i is bound to an individual text prompt T_i (CrossAttn(Z_i, T_i)). This supports prompt switches at chunk boundaries and interactive editing.

Multi-Shot Attention Sink: For streaming inference, a novel sink mechanism uses two cooperating anchor sets (Figure 7):

Global Sink (\mathcal{A}_g): First S_g frames of the video, fixed to preserve global identity.
Shot-Level Sink (\mathcal{A}_s): First S_s frames of the current shot, re-bound at every scene cut to maintain local temporal coherence. The effective key/value set at step t is \mathcal{K}_{eff}(t) = \mathcal{A}_g \cup \mathcal{A}_s \cup KV[t-W, t). This integrates seamlessly with chunk-wise prompting for interactive generation.

Empirical Validation / Results

1. Training Efficiency

Table 1: AR training efficiency of LongLive-2.0. (Red subscripts denote speedup over BF16+SP)

Input Length	BF16 w/o SP	BF16 w/ SP	BF16 Balanced SP	NVFP4 Balanced SP	Speedup
16s	75.3	52.2	45.8	40.1	1.3×
32s	202.7	162.7	136.8	119.3	1.4×
64s	OOM	1372.9	1196.5	639.5	2.1×

NVFP4+Balanced SP provides the fastest training, with gains most pronounced at long sequences (2.15× speedup at 64s).

Table 2: Progressively quantizing models in DMD training. (Peak per-GPU memory)

Generator	Real	Fake	Peak Memory	Ratio ↓
BF16	BF16	BF16	70.5 GB	-
NVFP4	BF16	BF16	63.3 GB	0.90×
NVFP4+LoRA	NVFP4	BF16	57.2 GB	0.81×
NVFP4+LoRA	NVFP4	NVFP4+LoRA	49.0 GB	0.69×

Progressive NVFP4 conversion reduces peak memory by 21.5 GB per GPU (to 69% of baseline).

2. Inference Efficiency

Table数和3: Inference efficiency under progressively enabled optimizations. (On NVIDIA GB200)

Inference Settings	FPS ↑	16s E2E Gen. (s) / Mem. (GB)	32s E2E Gen. (s) / Mem. (GB)	64s E2E Gen. (s) / Mem. (GB)
BF16	24.8	26.6 / 36.4	53.2 / 36.4	112.9 / 36.4
NVFP4	32.0	22.9 / 29.7	46.6 / 29.7	96.0 / 29.7
+ NVFP4 KV Cache	29.7	23.8 / 19.4	48.9 / 19.4	99.5 / 19.4
+ Async Decoding	29.7	15.9 / 19.4	29.1 / 19.4	57.6 / 19.4
3 Steps	35.2	12.7 / 19.4	23.2 / 19.4	46.0 / 19.4
2 Steps	45.7	11.2 / 19.4	19.2 / 19.4	36.3 / 19.4

NVFP4 with KV cache quantization reduces peak memory from 29.7 GB to 19.4 GB.
Asynchronous decoding significantly reduces end-to-end latency.
The 2-step system achieves 45.7 FPS with a 19.4 GB memory footprint for 64s videos.

3. Performance Evaluation

Table 4: Comparison on VBench (short video).

Model	Precision	#Steps	#Params	Resolution	Throughput (FPS) ↑	Total Score ↑
Self-Forcing [26]	BF16	4	1.3B	832×480	21.2	84.31
LongLive [64]	BF16	4	1.3B	832×480	20.7	84.87
LongLive-2.0	BF16	4	5B	1280×720	24.8	85.06
LongLive-2.0	NVFP4	4	5B	1280×720	29.7	84.51
LongLive-2.0	NVFP4	2	5B	1280×720	45.7	83.14

LongLive-2.0 achieves the highest throughput at 720p resolution.
NVFP4 with 2-step denoising enables real-time generation at 45.7 FPS.

Table 5: Comparison on VBench-Long (60s video). (Avg. Rank computed over 6 metrics; lower is better)

Method	Avg. Rank ↓	Subject Consistency ↑	Background Consistency ↑
Self-Forcing [26]	5.83	95.84	95.27
LongLive [64]	4.17	97.13	95.89
LongLive-2.0	3.67	97.48	97.00
→ LongLive-2.0 NVFP4	3.83	97.62	96.97

LongLive-2.0 achieves the best average rank, demonstrating strong long-range generation ability, with top scores in subject and background consistency.

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates that strong infrastructure can simplify algorithmic pipelines. The co-design of Balanced SP and NVFP4 enables direct, single-stage AR training, challenging the necessity of complex multi-stage distillation pipelines prevalent in the field.
Establishes the viability and advantages of end-to-end low-precision (FP4) training for large-scale generative video models, aligning training and inference precision to avoid quality degradation.

Practical Implications:

Significant Efficiency Gains: Provides up to 2.15× training and 1.84× inference speedups with substantial memory reduction, lowering the computational barrier for long video generation research and deployment.
Real-Time High-Resolution Generation: Enables real-time (45.7 FPS) generation of 720p long videos, making interactive applications more feasible.
Hardware-Aware Deployment: Offers a full NVFP4 stack for Blackwell GPUs and provides Sequence Parallelism inference as an efficient alternative for non-Blackwell architectures.
System-Level Optimization: Highlights the importance of end-to-end system optimization, including asynchronous decoding and KV cache quantization, for practical throughput.

Conclusion

LongLive-2.0 presents a comprehensive algorithm-infrastructure co-design system that addresses the efficiency bottlenecks in long video generation. Its core contributions are:

Balanced SP for efficient, load-balanced sequence-parallel AR training.
End-to-end NVFP4 integration for training and inference, reducing memory and accelerating computation.
A clean training pipeline that directly fine-tunes models for long, multi-shot, interactive generation.
An inference system with W4A4 NVFP4, quantized KV cache, and asynchronous decoding for high throughput.

The system achieves state-of-the-art efficiency (45.7 FPS for a 5B model) while maintaining strong benchmark performance. It is the first end-to-end NVFP4 training and inference system tailored for long video generation.

Limitations: The acceleration from NVFP4 inference is hardware-dependent, requiring Blackwell GPUs (e.g., GB200) for native support. On non-Blackwell GPUs, SP inference is used as an alternative acceleration path. Broader Impacts: Reduces computational costs and resource thresholds for video generation research. The infrastructure itself involves no negative social implications, sharing the ethical considerations of existing video generation models.