PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Summary (Overview)
- Novel KV-Cache Management: Introduces a Three-Partition KV Cache strategy that categorizes generation history into Sink tokens (full resolution, global anchors), Mid tokens (highly compressed, dynamically selected), and Recent tokens (full resolution, local coherence), bounding memory usage to ~4 GB for arbitrarily long videos.
- Efficient Context Compression: Proposes a Dual-Branch Compression module fusing progressive 3D convolutions (HR branch) with low-resolution VAE re-encoding (LR branch), achieving a 128× spatiotemporal volume compression (~32× token reduction) for intermediate history.
- Robust Long-Video Generation: Enables 24× temporal extrapolation, generating coherent 2-minute (120s) videos at 832×480 resolution and 16 FPS from training on merely 5-second clips, while achieving state-of-the-art VBench scores (e.g., Temporal Consistency: 26.07, Dynamic Degree: 56.25).
- Seamless Position Correction: Introduces Incremental RoPE Adjustment to correct positional discontinuities caused by dynamic token eviction/selection, ensuring stable generation with negligible overhead (<0.1% inference time).
- Dynamic Context Prioritization: Employs Dynamic Context Selection based on query-key affinity to route only the most informative compressed mid tokens into the active attention context, improving subject consistency and CLIP scores.
Introduction and Theoretical Foundation
Autoregressive video diffusion models have advanced short-clip generation but face critical bottlenecks for long-video synthesis:
- Error Accumulation: Small prediction errors compound iteratively during autoregressive denoising, leading to progressive quality degradation and semantic drift.
- Unbounded Memory Growth: The Key-Value (KV) cache scales linearly with video length. For a 2-minute, 832×480 video at 16 FPS, the full attention context grows to ~749K tokens, requiring ~138 GB of KV storage, exceeding single-GPU memory.
Existing methods like Self-Forcing or DeepForcing either suffer from severe error accumulation beyond their training horizon or rely on aggressive history truncation, leading to irreversible loss of critical intermediate memory. This creates a fundamental dilemma: mitigating error accumulation requires extensive context, yet hardware constraints force discarding memory.
PackForcing addresses this by building upon insights from DeepForcing (attention sinks, participative compression) but proposes to compress unselected intermediate tokens rather than irreversibly dropping them. The core theoretical foundation is a principled three-partition KV cache design that applies tailored policies based on temporal role and information density, enabling bounded memory while preserving comprehensive historical context.
Methodology
3.1 Preliminaries
The base model builds upon the Flow Matching framework. Given a clean video latent and standard Gaussian noise , the noisy latent at noise level is:
A neural network is trained to predict the velocity field .
For KV Caching, a video sequence is partitioned into non-overlapping blocks, each containing frames. During generation of block , each transformer layer attends to the KV pairs cached from all previous blocks:
where . The attention operation is:
This leads to linear KV cache growth, which is the fundamental scaling bottleneck.
3.2 Three-Partition KV Cache
The core idea is to decouple generation history into three functional partitions with tailored policies (Fig. 2a).
-
Sink Tokens (Full resolution, never evicted): The earliest generated frames () serve as critical semantic anchors. For layer :
These tokens lock in scene layout, subject identity, and global style. With (two blocks), they consume <2% of the total token budget but provide a stable global reference.
-
Compressed Mid Tokens (~32× token reduction & dynamically routed): The vast majority of video history between the sink and recent window is represented by highly compressed KV pairs via the dual-branch module. Dynamic Context Selection routes only the most informative blocks into the active set :
Here, is the token count per compressed block: . With default settings (), tokens, a ~32× reduction from the original tokens.
-
Recent & Current Tokens (Dual-resolution shifting): The most recently generated frames () and the current block are kept pristine for high-fidelity local dynamics:
A low-resolution backup is computed concurrently for seamless transition into the mid-buffer.
Bounded Attention Context: The final context for layer is the concatenation:
This enforces a constant token count for attention, independent of total video length , ensuring attention complexity.
3.3 Dual-Branch HR Compression
The mid partition requires massive token reduction while retaining structural and semantic information. The Dual-Branch Compression module (Fig. 2b, 2d) aggregates fine-grained structure (HR branch) and coarse semantics (LR branch).
- HR Branch: Progressive 3D Convolution: Operates directly on VAE latent . Applies a cascade of strided 3D convolutions, first performing a 2× temporal compression, followed by three stages of 2× spatial compression, and a final projection to hidden dimension . Yields a structurally rich representation with a 128× volume reduction ().
- LR Branch: Pixel-Space Re-encoding: Decodes the latent back to pixel frames, applies 3D average pooling (2× temporally, 4× spatially), and re-encodes the pooled frames using the frozen VAE encoder followed by patch embedding to obtain . This preserves perceptual layout better than direct latent pooling.
Feature Fusion: The outputs are fused via element-wise addition:
3.4 Dual-Resolution Shifting and Incremental RoPE adjustment
A Dual-Resolution Shifting pipeline preserves long-term memory: during chunk generation, a full-resolution KV cache and a reduced-resolution backup are computed concurrently. Old full-resolution tokens are replaced by new ones, while pre-computed compressed tokens slide into the mid partition.
The Position Misalignment Problem: When evicting blocks ( frames) to maintain the capacity budget , a position gap is created between sink keys (positions ) and the earliest surviving mid key (position ), breaking positional continuity.
Incremental RoPE adjustment: Exploits the multiplicative property of Rotary Position Embeddings (RoPE) and that eviction shifts only the temporal axis. A highly efficient, temporal-only RoPE adjustment is applied to the sink keys:
where are identity rotations leaving spatial positions unchanged. This operation costs <0.1% of total inference time.
3.5 Dynamic Context Selection
To prioritize visually critical keyframes, a dynamic context selection mechanism based on query-key affinity is introduced. Unlike destructive pruning, it employs a non-destructive soft-selection, retrieving only the top- most relevant mid-blocks for attention, while keeping unselected tokens archived for potential future reactivation. To ensure negligible overhead (<1%), affinity scoring occurs only at the first denoising step of each block, with optimizations like query token subsampling and using half the attention heads.
3.6 Empirical Analysis of Temporal Attention Patterns
Empirical investigation of attention distribution during generation (Fig. 3) reveals two critical insights justifying the three-partition design:
- Attention demand persists across the entire video history, invalidating naive FIFO eviction strategies (Fig. 3c shows near-flat late-stage importance with mean=0.499).
- Highly attended tokens are sparsely and dynamically distributed, exhibiting a high Jaccard distance (0.75) between consecutive selection steps (Fig. 3d).
These observations motivate aggressively compressing the sporadically queried yet globally essential mid-range tokens to preserve comprehensive context within bounded memory.
Empirical Validation / Results
4.1 Experimental Settings
- Backbone: Wan2.1-T2V -1.3B, generating 832×480 videos at 16 FPS.
- Training: 3,000 iterations on a 20-latent-frame temporal window (~5 seconds).
- Cache Partitions: , , .
- Generation: Chunk-wise with latent frames per block and distilled denoising steps.
- Evaluation: VBench-Long protocol with 128 prompts from MovieGen, evaluated at 60s and 120s durations using 7 VBench metrics and temporal CLIP scores.
4.2 Main Results
Quantitative Comparison (VBench): Table 1 shows PackForcing excels in motion synthesis, achieving the highest Dynamic Degree at both 60s (56.25) and 120s (54.12). It also demonstrates superior stability over extended horizons, with marginal performance declines compared to significant drops in baselines like Self-Forcing.
Table 1: Quantitative comparison on 60 s and 120 s benchmarks (7 VBench metrics). Best results in bold.
| Method | Dyn. Deg. ↑ | Mot. Smth. ↑ | Over. Cons. ↑ | Img. Qual. ↑ | Aest. Qual. ↑ | Subj. Cons. ↑ | Back. Cons. ↑ |
|---|---|---|---|---|---|---|---|
| 60 s generation | |||||||
| CausVid | 48.43 | 98.04 | 23.36 | 65.69 | 60.63 | 84.53 | 89.84 |
| LongLive | 44.53 | 98.70 | 25.73 | 69.06 | 63.30 | 92.00 | 92.97 |
| Self-Forcing | 35.93 | 98.26 | 24.92 | 66.62 | 57.15 | 80.41 | 86.95 |
| Rolling Forcing | 33.59 | 98.70 | 25.73 | 71.06 | 61.43 | 91.62 | 93.00 |
| Deep Forcing | 53.67 | 98.56 | 21.75 | 67.75 | 58.88 | 92.55 | 93.80 |
| PackForcing (ours) | 56.25 | 98.29 | 26.07 | 69.36 | 62.56 | 90.49 | 93.46 |
| 120 s generation | |||||||
| CausVid | 50.00 | 98.11 | 23.13 | 65.41 | 60.11 | 83.24 | 87.83 |
| LongLive | 44.53 | 98.72 | 25.95 | 69.59 | 63.00 | 91.54 | 93.73 |
| Self-Forcing | 30.46 | 98.12 | 23.42 | 62.49 | 51.68 | 74.40 | 83.57 |
| Rolling Forcing | 35.15 | 98.65 | 25.45 | 70.58 | 60.62 | 90.14 | 92.40 |
| Deep Forcing | 52.84 | 98.22 | 21.38 | 68.21 | 57.96 | 91.95 | 92.55 |
| PackForcing (ours) | 54.12 | 98.35 | 26.05 | 69.67 | 61.98 | 92.84 | 91.88 |
Long-Range Consistency (CLIP Scores): Table 2 shows PackForcing maintains the highest and most stable text-video alignment throughout 60-second generation, with a marginal decline of only 1.14 points (34.04 to 32.90), compared to severe drops in baselines (e.g., Self-Forcing: 6.77-point drop).
Table 2: CLIP Score comparison on long video generation (60s).
| Method | 0–10 s | 10–20 s | 20–30 s | 30–40 s | 40–50 s | 50–60 s | Overall |
|---|---|---|---|---|---|---|---|
| CausVid | 32.65 | 31.78 | 31.47 | 31.13 | 30.81 | 30.79 | 31.44 |
| LongLive | 33.95 | 33.38 | 33.14 | 33.51 | 33.45 | 33.36 | 33.46 |
| Self-Forcing | 33.89 | 33.23 | 31.66 | 29.99 | 28.37 | 27.12 | 30.71 |
| Rolling Forcing | 33.85 | 33.39 | 32.94 | 32.78 | 32.49 | 32.25 | 32.95 |
| Deep Forcing | 33.47 | 33.29 | 32.38 | 32.28 | 32.26 | 32.27 | 32.33 |
| PackForcing (ours) | 34.04 | 33.99 | 33.70 | 33.37 | 33.24 | 32.90 | 33.54 |
Qualitative Comparison: Figure 4 presents sampled frames from a 120-second generation. PackForcing consistently maintains strict subject identity and high visual fidelity throughout the sequence, while baselines suffer from progressive degradation (color shifts, loss of details, subject inconsistency, or restricted motion).
4.3 Ablation Studies
Systematic ablations on the 60s benchmark evaluate each critical component (qualitative results in Fig. 5).
- Effect of Sink Tokens: Table 3 shows removing attention sinks () causes severe semantic drift (CLIP drops from 35.09 to 31.24, Subject Consistency from 93.11 to 74.72). An excessively large sink () maximizes consistency but stifles motion (Dynamic Degree plummets to 35.16). achieves the optimal balance.
Table 3: Quantitative ablation results of sink size. | Sink Size | Overall CLIP ↑ | Dyn. Deg. ↑ | Mot. Smth. ↑ | Over. Cons. ↑ | Img. Qual. ↑ | Aest. Qual. ↑ | Subj. Cons. ↑ | Back. Cons. ↑ | | :---: | :---: | :---: | :---: | :---: |