PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Summary (Overview)

Novel KV-Cache Management: Introduces a Three-Partition KV Cache strategy that categorizes generation history into Sink tokens (full resolution, global anchors), Mid tokens (highly compressed, dynamically selected), and Recent tokens (full resolution, local coherence), bounding memory usage to ~4 GB for arbitrarily long videos.
Efficient Context Compression: Proposes a Dual-Branch Compression module fusing progressive 3D convolutions (HR branch) with low-resolution VAE re-encoding (LR branch), achieving a 128× spatiotemporal volume compression (~32× token reduction) for intermediate history.
Robust Long-Video Generation: Enables 24× temporal extrapolation, generating coherent 2-minute (120s) videos at 832×480 resolution and 16 FPS from training on merely 5-second clips, while achieving state-of-the-art VBench scores (e.g., Temporal Consistency: 26.07, Dynamic Degree: 56.25).
Seamless Position Correction: Introduces Incremental RoPE Adjustment to correct positional discontinuities caused by dynamic token eviction/selection, ensuring stable generation with negligible overhead (<0.1% inference time).
Dynamic Context Prioritization: Employs Dynamic Context Selection based on query-key affinity to route only the most informative compressed mid tokens into the active attention context, improving subject consistency and CLIP scores.

Introduction and Theoretical Foundation

Autoregressive video diffusion models have advanced short-clip generation but face critical bottlenecks for long-video synthesis:

Error Accumulation: Small prediction errors compound iteratively during autoregressive denoising, leading to progressive quality degradation and semantic drift.
Unbounded Memory Growth: The Key-Value (KV) cache scales linearly with video length. For a 2-minute, 832×480 video at 16 FPS, the full attention context grows to ~749K tokens, requiring ~138 GB of KV storage, exceeding single-GPU memory.

Existing methods like Self-Forcing or DeepForcing either suffer from severe error accumulation beyond their training horizon or rely on aggressive history truncation, leading to irreversible loss of critical intermediate memory. This creates a fundamental dilemma: mitigating error accumulation requires extensive context, yet hardware constraints force discarding memory.

PackForcing addresses this by building upon insights from DeepForcing (attention sinks, participative compression) but proposes to compress unselected intermediate tokens rather than irreversibly dropping them. The core theoretical foundation is a principled three-partition KV cache design that applies tailored policies based on temporal role and information density, enabling bounded memory while preserving comprehensive historical context.

Methodology

3.1 Preliminaries

The base model builds upon the Flow Matching framework. Given a clean video latent $x_0$ and standard Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$ , the noisy latent at noise level $\sigma \in [0, 1]$ is:

x_\sigma = (1 - \sigma)x_0 + \sigma\epsilon.

A neural network $f_\theta$ is trained to predict the velocity field $v_\theta(x_\sigma, \sigma) \approx \epsilon - x_0$ .

For KV Caching, a video sequence is partitioned into non-overlapping blocks, each containing $B_f$ frames. During generation of block $i$ , each transformer layer $l$ attends to the KV pairs cached from all previous blocks:

\mathcal{C}^l = \{(K_j^l, V_j^l)\}_{j=1}^{i-1},

where $K_j^l, V_j^l \in \mathbb{R}^{n \times N_h \times d_h}$ . The attention operation is:

\text{Attn}(Q_i, \mathcal{C}^l) = \text{softmax}\left(\frac{Q_i K_{1:i}^\top}{\sqrt{d_h}}\right)V_{1:i}.

This leads to linear KV cache growth, which is the fundamental scaling bottleneck.

3.2 Three-Partition KV Cache

The core idea is to decouple generation history into three functional partitions with tailored policies (Fig. 2a).

Sink Tokens (Full resolution, never evicted): The earliest generated frames ( $N_{\text{sink}}$ ) serve as critical semantic anchors. For layer $l$ :
$\mathcal{C}^l_{\text{sink}} = \{(K_j^l, V_j^l)\}_{j=1}^{N_{\text{sink}}/B_f}, \quad |\mathcal{C}^l_{\text{sink}}| = \frac{N_{\text{sink}}}{B_f} n.$
These tokens lock in scene layout, subject identity, and global style. With $N_{\text{sink}}=8$ (two blocks), they consume <2% of the total token budget but provide a stable global reference.
Compressed Mid Tokens (~32× token reduction & dynamically routed): The vast majority of video history between the sink and recent window is represented by highly compressed KV pairs $(\tilde{K}_j^l, \tilde{V}_j^l)$ via the dual-branch module. Dynamic Context Selection routes only the $N_{\text{mid}}$ most informative blocks into the active set $\mathcal{S}_{\text{mid}}$ :
$\mathcal{C}^l_{\text{mid}} = \{(\tilde{K}_j^l, \tilde{V}_j^l)\}_{j \in \mathcal{S}_{\text{mid}}}, \quad |\mathcal{C}^l_{\text{mid}}| \le N_{\text{mid}} \cdot N_c.$
Here, $N_c$ is the token count per compressed block: $N_c = \lfloor B_f / 2 \rfloor \times \lfloor h / 4 \rfloor \times \lfloor w / 4 \rfloor$ . With default settings ( $B_f=4, h=30, w=52$ ), $N_c = 182$ tokens, a ~32× reduction from the original $n = 6,240$ tokens.
Recent & Current Tokens (Dual-resolution shifting): The most recently generated frames ( $N_{\text{recent}}$ ) and the current block $i$ are kept pristine for high-fidelity local dynamics:
$\mathcal{C}^l_{\text{rc}} = \{(K_j^l, V_j^l)\}_{j=i - N_{\text{recent}}/B_f}^{i}, \quad |\mathcal{C}^l_{\text{rc}}| = \left(\frac{N_{\text{recent}}}{B_f} + 1\right) n.$
A low-resolution backup is computed concurrently for seamless transition into the mid-buffer.

Bounded Attention Context: The final context for layer $l$ is the concatenation:

\mathcal{C}^l = \mathcal{C}^l_{\text{sink}} \Vert \mathcal{C}^l_{\text{mid}} \Vert \mathcal{C}^l_{\text{rc}}.

This enforces a constant token count for attention, independent of total video length $T$ , ensuring $O(1)$ attention complexity.

3.3 Dual-Branch HR Compression

The mid partition requires massive token reduction while retaining structural and semantic information. The Dual-Branch Compression module (Fig. 2b, 2d) aggregates fine-grained structure (HR branch) and coarse semantics (LR branch).

HR Branch: Progressive 3D Convolution: Operates directly on VAE latent $z \in \mathbb{R}^{B \times C \times T \times H \times W}$ . Applies a cascade of strided 3D convolutions, first performing a 2× temporal compression, followed by three stages of 2× spatial compression, and a final $1\times1\times1$ projection to hidden dimension $d=1536$ . Yields a structurally rich representation $h_{\text{HR}}$ with a 128× volume reduction ( $2 \times 8 \times 8$ ).
LR Branch: Pixel-Space Re-encoding: Decodes the latent $z$ back to pixel frames, applies 3D average pooling (2× temporally, 4× spatially), and re-encodes the pooled frames using the frozen VAE encoder followed by patch embedding to obtain $h_{\text{LR}}$ . This preserves perceptual layout better than direct latent pooling.

Feature Fusion: The outputs are fused via element-wise addition:

\tilde{h} = h_{\text{HR}} + h_{\text{LR}} \in \mathbb{R}^{B \times N_c \times d}.

3.4 Dual-Resolution Shifting and Incremental RoPE adjustment

A Dual-Resolution Shifting pipeline preserves long-term memory: during chunk generation, a full-resolution KV cache and a reduced-resolution backup are computed concurrently. Old full-resolution tokens are replaced by new ones, while pre-computed compressed tokens slide into the mid partition.

The Position Misalignment Problem: When evicting $\Delta$ blocks ( $\delta = \Delta B_f$ frames) to maintain the capacity budget $N_{\text{mid}}$ , a position gap is created between sink keys (positions $0, ..., N_{\text{sink}}-1$ ) and the earliest surviving mid key (position $N_{\text{sink}} + \delta$ ), breaking positional continuity.

Incremental RoPE adjustment: Exploits the multiplicative property of Rotary Position Embeddings (RoPE) and that eviction shifts only the temporal axis. A highly efficient, temporal-only RoPE adjustment is applied to the sink keys:

k'_{\text{sink}} = k_{\text{sink}} \odot e^{i\theta_t(\delta), \mathbf{1}_h, \mathbf{1}_w},

where $\mathbf{1}_h, \mathbf{1}_w$ are identity rotations leaving spatial positions unchanged. This operation costs <0.1% of total inference time.

3.5 Dynamic Context Selection

To prioritize visually critical keyframes, a dynamic context selection mechanism based on query-key affinity is introduced. Unlike destructive pruning, it employs a non-destructive soft-selection, retrieving only the top- $K$ most relevant mid-blocks for attention, while keeping unselected tokens archived for potential future reactivation. To ensure negligible overhead (<1%), affinity scoring occurs only at the first denoising step of each block, with optimizations like query token subsampling and using half the attention heads.

3.6 Empirical Analysis of Temporal Attention Patterns

Empirical investigation of attention distribution during generation (Fig. 3) reveals two critical insights justifying the three-partition design:

Attention demand persists across the entire video history, invalidating naive FIFO eviction strategies (Fig. 3c shows near-flat late-stage importance with mean=0.499).
Highly attended tokens are sparsely and dynamically distributed, exhibiting a high Jaccard distance (0.75) between consecutive selection steps (Fig. 3d).

These observations motivate aggressively compressing the sporadically queried yet globally essential mid-range tokens to preserve comprehensive context within bounded memory.

Empirical Validation / Results

4.1 Experimental Settings

Backbone: Wan2.1-T2V -1.3B, generating 832×480 videos at 16 FPS.
Training: 3,000 iterations on a 20-latent-frame temporal window (~5 seconds).
Cache Partitions: $N_{\text{sink}}=8$ , $N_{\text{recent}}=4$ , $N_{\text{top}}=16$ .
Generation: Chunk-wise with $B_f=4$ latent frames per block and $S=4$ distilled denoising steps.
Evaluation: VBench-Long protocol with 128 prompts from MovieGen, evaluated at 60s and 120s durations using 7 VBench metrics and temporal CLIP scores.

4.2 Main Results

Quantitative Comparison (VBench): Table 1 shows PackForcing excels in motion synthesis, achieving the highest Dynamic Degree at both 60s (56.25) and 120s (54.12). It also demonstrates superior stability over extended horizons, with marginal performance declines compared to significant drops in baselines like Self-Forcing.

Table 1: Quantitative comparison on 60 s and 120 s benchmarks (7 VBench metrics). Best results in bold.

Method	Dyn. Deg. ↑	Mot. Smth. ↑	Over. Cons. ↑	Img. Qual. ↑	Aest. Qual. ↑	Subj. Cons. ↑	Back. Cons. ↑
60 s generation
CausVid	48.43	98.04	23.36	65.69	60.63	84.53	89.84
LongLive	44.53	98.70	25.73	69.06	63.30	92.00	92.97
Self-Forcing	35.93	98.26	24.92	66.62	57.15	80.41	86.95
Rolling Forcing	33.59	98.70	25.73	71.06	61.43	91.62	93.00
Deep Forcing	53.67	98.56	21.75	67.75	58.88	92.55	93.80
PackForcing (ours)	56.25	98.29	26.07	69.36	62.56	90.49	93.46
120 s generation
CausVid	50.00	98.11	23.13	65.41	60.11	83.24	87.83
LongLive	44.53	98.72	25.95	69.59	63.00	91.54	93.73
Self-Forcing	30.46	98.12	23.42	62.49	51.68	74.40	83.57
Rolling Forcing	35.15	98.65	25.45	70.58	60.62	90.14	92.40
Deep Forcing	52.84	98.22	21.38	68.21	57.96	91.95	92.55
PackForcing (ours)	54.12	98.35	26.05	69.67	61.98	92.84	91.88

Long-Range Consistency (CLIP Scores): Table 2 shows PackForcing maintains the highest and most stable text-video alignment throughout 60-second generation, with a marginal decline of only 1.14 points (34.04 to 32.90), compared to severe drops in baselines (e.g., Self-Forcing: 6.77-point drop).

Table 2: CLIP Score comparison on long video generation (60s).

Method	0–10 s	10–20 s	20–30 s	30–40 s	40–50 s	50–60 s	Overall
CausVid	32.65	31.78	31.47	31.13	30.81	30.79	31.44
LongLive	33.95	33.38	33.14	33.51	33.45	33.36	33.46
Self-Forcing	33.89	33.23	31.66	29.99	28.37	27.12	30.71
Rolling Forcing	33.85	33.39	32.94	32.78	32.49	32.25	32.95
Deep Forcing	33.47	33.29	32.38	32.28	32.26	32.27	32.33
PackForcing (ours)	34.04	33.99	33.70	33.37	33.24	32.90	33.54

Qualitative Comparison: Figure 4 presents sampled frames from a 120-second generation. PackForcing consistently maintains strict subject identity and high visual fidelity throughout the sequence, while baselines suffer from progressive degradation (color shifts, loss of details, subject inconsistency, or restricted motion).

4.3 Ablation Studies

Systematic ablations on the 60s benchmark evaluate each critical component (qualitative results in Fig. 5).

Effect of Sink Tokens: Table 3 shows removing attention sinks ( $N_{\text{sink}}=0$ ) causes severe semantic drift (CLIP drops from 35.09 to 31.24, Subject Consistency from 93.11 to 74.72). An excessively large sink ( $N_{\text{sink}}=16$ ) maximizes consistency but stifles motion (Dynamic Degree plummets to 35.16). $N_{\text{sink}}=8$ achieves the optimal balance.