Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

Summary (Overview)

  • Core Problem: Train-free, frame-level autoregressive methods (e.g., FIFO-Diffusion) enable infinite video generation with constant memory, but suffer from a training-inference gap (model handles multiple noise levels during inference vs. single level during training) and insufficient long-term consistency modeling.
  • Proposed Solution: MIGA – A novel train-free method featuring:
    • Two-Stage Training-Inference Alignment (TTA): Mitigates the noise span gap via Stage 1 Zigzag Iterative Denoising (slower noise level changes) and Stage 2 Unified Noise Level Denoising (aligns with training conditions).
    • Dual Consistency Enhancement (DCE): Improves long-term consistency via Self-Reflection (evaluates/corrects early high-noise frames using latent similarity) and Long-Range Frame Guidance (incorporates clean, distant frames to guide local denoising).
  • Key Results: Achieves state-of-the-art performance on VBench and NarrLV benchmarks. For instance, MIGA improves over FIFO-Diffusion by 4.7% in Subject Consistency and 2.0% in Background Consistency on VBench using VideoCrafter2.

Introduction and Theoretical Foundation

Recent foundation video diffusion models excel at short clips but are limited by fixed frame lengths. Training new models for long videos is computationally expensive. Train-free methods aim to extend these foundation models without retraining. Frame-level autoregressive frameworks like FIFO-Diffusion are promising as they generate videos iteratively frame-by-frame with constant memory, enabling infinite-length generation.

However, two key limitations hinder these frameworks:

  1. Training-Inference Gap: During training, the model denoises latents at a unified noise level. During autoregressive inference, it must handle a queue of latents with progressively increasing noise levels, creating a mismatch that leads to content drift and artifacts.
  2. Long-Term Consistency: Existing methods lack explicit modeling of dependencies between distant frames, leading to suboptimal temporal consistency over long sequences.

MIGA is proposed to inherit the benefits of autoregressive frameworks while addressing these limitations through novel alignment and consistency mechanisms.

Methodology

MIGA builds upon the frame-level autoregressive generation paradigm. Let the latent feature for frame ii at time step τt\tau_t be zτtiRl×dz^i_{\tau_t} \in \mathbb{R}^{l \times d}. A foundation model with a noise prediction network ϵθ()\epsilon_\theta(\cdot) and sampler ϕ()\phi(\cdot) is used.

Preliminaries: Train-Free Frame-Level Autoregressive Generation

Methods like FIFO-Diffusion maintain a latent queue Q={zτ11,...,zτTT}Q = \{z^1_{\tau_1}, ..., z^T_{\tau_T}\} of length L=TL = T (total denoising steps) with increasing noise levels. One inference step over the queue is:

{zτ01,...,zτT1T}=Φ({zτ11,...,zτTT},{τ1,...,τT};ϵθ)\{z^1_{\tau_0}, ..., z^T_{\tau_{T-1}}\} = \Phi(\{z^1_{\tau_1}, ..., z^T_{\tau_T}\}, \{\tau_1, ..., \tau_T\}; \epsilon_\theta)

The clean latent zτ01z^1_{\tau_0} is dequeued, and a new Gaussian noise latent zτTTz^T_{\tau_T} is enqueued, enabling continuous generation. Since T>f0T > f_0 (model's frame capacity), a sliding window approach is used for the sampler Φ()\Phi(\cdot).

Two-Stage Training-Inference Alignment (TTA)

To reduce the excessive noise span presented to the model during inference, MIGA decomposes generation into two stages.

Stage 1: Zigzag Iterative Denoising The queue is initialized and maintained with a "zigzag" structure where the noise level changes only every LzigL_{zig} latents, slowing the rate of change.

Qs1={zτe1,,zτeLzigLzig,zτe+1Lzig+1,,zτe+12LzigLzig,,zτTLLzig+1,,zτTLLzig}Q_{s1} = \{ \underbrace{z^1_{\tau_e}, \cdots, z^{L_{zig}}_{\tau_e}}_{L_{zig}}, \underbrace{z^{L_{zig}+1}_{\tau_{e+1}}, \cdots, z^{2L_{zig}}_{\tau_{e+1}}}_{L_{zig}}, \cdots, \underbrace{z^{L-L_{zig}+1}_{\tau_T}, \cdots, z^{L}_{\tau_T}}_{L_{zig}} \}

At each iteration, LzigL_{zig} partially denoised latents are dequeued and LzigL_{zig} new Gaussian latents are enqueued.

Stage 2: Denoising at a Unified Noise Level After nn iterations in Stage 1, nLzignL_{zig} latents at the same noise level τe1\tau_{e-1} form queue Qs2Q_{s2}. A unified denoising process is then applied:

Qs2={zτe11,zτe12,...,zτe1nLzig}Q_{s2} = \{z^1_{\tau_{e-1}}, z^2_{\tau_{e-1}}, ..., z^{nL_{zig}}_{\tau_{e-1}}\}

This aligns perfectly with the model's training condition (noise span of 1).

Dual Consistency Enhancement (DCE)

Self. Reflection A test-time scaling approach that efficiently evaluates and corrects early high-noise latents to prevent future inconsistencies.

  • Consistency Metric: Uses cosine similarity between latent features, avoiding external models. For evaluation latents qevalRfeval×l×cq_{eval} \in \mathbb{R}^{f_{eval} \times l \times c} and reference latents qrefRfref×l×cq_{ref} \in \mathbb{R}^{f_{ref} \times l \times c}: qeval=norm1(mean2(qeval)),qref=norm1(mean2(qref))q'_{eval} = \text{norm}_1(\text{mean}_2(q_{eval})), \quad q'_{ref} = \text{norm}_1(\text{mean}_2(q_{ref})) Cscore=mean1(mean2(qevalqrefT))C_{score} = \text{mean}_1\left( \text{mean}_2\left( q'_{eval} {q'_{ref}}^T \right) \right) The paper finds this metric correlates strongly even at high noise levels.
  • Process: A judgment index fjudgf_{judg} at the queue's tail is evaluated against preceding latents. If CscoreC_{score} drops below a threshold δadj\delta_{adj}, an expanded search with nsampn_{samp} candidates is triggered for correction, guided by earlier consistent latents.

Long-Range Frame Guidance When the model processes a local window of latents starting at position ll, it explicitly incorporates mguidm_{guid} clean, sparsely sampled latents from earlier in the queue to guide denoising. For l(mguid,Lmguid]l \in (m_{guid}, L - m_{guid}], the model input becomes:

qinput=[z1,...,zmguid,zl,...,zl+f0mguid1]q_{input} = [z^1, ..., z^{m_{guid}}, z^l, ..., z^{l+f_0-m_{guid}-1}]

This facilitates feature interaction between distant frames.

Empirical Validation / Results

Experiments were conducted on VBench (video quality) and NarrLV (narrative expressiveness) using foundation models VideoCrafter2 and Wan2.1.

Quantitative Results on VBench

Table 1: Quantitative results on VBench. Best results in bold.

MethodInfiniteS.C.B.C.M.S.T.F.O.S.
VideoCrafter2-Based
FreePCA93.5795.2493.7391.2793.45
FreeLong95.7296.4298.3897.2896.95
FIFO-Diffusion92.9295.0197.1994.9495.02
ScalingNoise94.2995.5297.8696.1295.95
MIGA (ours)97.6696.9998.6098.0397.82
Wan2.1-Based
FIFO-Diffusion92.6793.3798.0397.0995.29
MIGA (ours)96.4695.5098.8598.1497.24

MIGA achieves SOTA across all metrics for both foundation models.

Quantitative Results on NarrLV

Table 2: Quantitative results on NarrLV under varying TNA settings.

MethodInfiniteTNA=2TNA=3TNA=4
s_attt_attt_acts_attt_attt_acts_attt_attt_act
VideoCrafter2-Based
FreePCA56.9658.7256.4153.6153.9352.5750.4657.2853.27
FreeLong59.4359.5755.9556.5759.8256.5754.1360.5354.13
ScalingNoise59.2855.4758.0953.2758.1454.0552.3758.4153.59
FIFO-Diffusion67.0263.5558.2961.1560.6458.4266.0966.0154.66
MIGA (ours)69.7863.9459.0163.5361.0559.5268.8768.7755.78
Wan2.1-Based
FIFO-Diffusion67.7764.2565.4055.4259.0258.9157.4356.1053.89
MIGA (ours)79.3267.8767.9469.4866.3363.8675.0572.3162.90

MIGA demonstrates superior narrative expressiveness, especially with the more powerful Wan2.1 backbone.

Ablation Studies

Core components (TTA and DCE) are validated. FIFO-Diffusion is the baseline.

Table As presented in the paper: Ablation results of core mechanisms.

TTADCES.C.B.C.M.S.T.F.O.S.
92.9295.0197.1994.9495.02
96.7496.7597.5797.1297.05
96.1096.4797.8896.5696.75
97.6696.9998.6098.0397.82

Table 4 (from paper): Ablation results of LzigL_{zig}.

LzigL_{zig}S.C.B.C.M.S.T.F.O.S.
194.2394.5297.9896.4795.80
294.2495.9398.5597.9096.66
495.3795.9698.6598.0297.00
695.1496.0498.6097.9796.94
895.5495.9698.5697.9096.99

Table 5 (from paper): Ablation results of mguidm_{guid}.

mguidm_{guid}S.C.B.C.M.S.T.F.O.S.
094.2394.5297.9896.4795.80
294.6694.7298.6498.0596.52
494.5994.5898.6498.1096.48
695.4595.6998.4597.8996.87
895.3295.1298.6098.0596.77

Key Findings:

  • Both TTA and DCE individually improve performance, with combined use yielding the best results.
  • Increasing LzigL_{zig} (reducing noise span) improves performance up to a point (Lzig=4L_{zig}=4 is optimal in this setup).
  • Long-range guidance (mguid=6m_{guid}=6) provides consistent benefits.
  • Self-reflection threshold δadj\delta_{adj} controls the trade-off between performance gain and computational cost.

Qualitative Results

MIGA generates high-quality, consistent long videos (1000+ frames). Visual results show significant reduction in noise and content drift compared to the baseline (see Fig. 4 in the paper).

Theoretical and Practical Implications

  • Theoretical: Provides a principled framework to address the training-inference gap in autoregressive diffusion models by actively managing noise span. Introduces a self-supervised consistency metric based on latent similarity, demonstrating its correlation across noise levels.
  • Practical: Enables the generation of infinitely long, coherent videos using existing foundation models without any training, dramatically lowering the barrier for applications like film pre-visualization, game development, and long-form content creation. The method maintains constant memory usage.

Conclusion

MIGA successfully enhances train-free infinite-frame video generation by proposing:

  1. A Two-Stage Training-Inference Alignment (TTA) mechanism that proactively reduces the noise span mismatch.
  2. A Dual Consistency Enhancement (DCE) mechanism that ensures long-term coherence through self-reflection and long-range guidance.

Extensive experiments confirm MIGA's state-of-the-art performance in generating consistent, narrative-rich long videos. Future work may focus on incorporating additional conditioning (e.g., physical laws) to mitigate rare hallucination issues and generate even more realistic content.