Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

Summary (Overview)

Core Problem: Train-free, frame-level autoregressive methods (e.g., FIFO-Diffusion) enable infinite video generation with constant memory, but suffer from a training-inference gap (model handles multiple noise levels during inference vs. single level during training) and insufficient long-term consistency modeling.
Proposed Solution: MIGA – A novel train-free method featuring:
- Two-Stage Training-Inference Alignment (TTA): Mitigates the noise span gap via Stage 1 Zigzag Iterative Denoising (slower noise level changes) and Stage 2 Unified Noise Level Denoising (aligns with training conditions).
- Dual Consistency Enhancement (DCE): Improves long-term consistency via Self-Reflection (evaluates/corrects early high-noise frames using latent similarity) and Long-Range Frame Guidance (incorporates clean, distant frames to guide local denoising).
Key Results: Achieves state-of-the-art performance on VBench and NarrLV benchmarks. For instance, MIGA improves over FIFO-Diffusion by 4.7% in Subject Consistency and 2.0% in Background Consistency on VBench using VideoCrafter2.

Introduction and Theoretical Foundation

Recent foundation video diffusion models excel at short clips but are limited by fixed frame lengths. Training new models for long videos is computationally expensive. Train-free methods aim to extend these foundation models without retraining. Frame-level autoregressive frameworks like FIFO-Diffusion are promising as they generate videos iteratively frame-by-frame with constant memory, enabling infinite-length generation.

However, two key limitations hinder these frameworks:

Training-Inference Gap: During training, the model denoises latents at a unified noise level. During autoregressive inference, it must handle a queue of latents with progressively increasing noise levels, creating a mismatch that leads to content drift and artifacts.
Long-Term Consistency: Existing methods lack explicit modeling of dependencies between distant frames, leading to suboptimal temporal consistency over long sequences.

MIGA is proposed to inherit the benefits of autoregressive frameworks while addressing these limitations through novel alignment and consistency mechanisms.

Methodology

MIGA builds upon the frame-level autoregressive generation paradigm. Let the latent feature for frame $i$ at time step $\tau_t$ be $z^i_{\tau_t} \in \mathbb{R}^{l \times d}$ . A foundation model with a noise prediction network $\epsilon_\theta(\cdot)$ and sampler $\phi(\cdot)$ is used.

Preliminaries: Train-Free Frame-Level Autoregressive Generation

Methods like FIFO-Diffusion maintain a latent queue $Q = \{z^1_{\tau_1}, ..., z^T_{\tau_T}\}$ of length $L = T$ (total denoising steps) with increasing noise levels. One inference step over the queue is:

\{z^1_{\tau_0}, ..., z^T_{\tau_{T-1}}\} = \Phi(\{z^1_{\tau_1}, ..., z^T_{\tau_T}\}, \{\tau_1, ..., \tau_T\}; \epsilon_\theta)

The clean latent $z^1_{\tau_0}$ is dequeued, and a new Gaussian noise latent $z^T_{\tau_T}$ is enqueued, enabling continuous generation. Since $T > f_0$ (model's frame capacity), a sliding window approach is used for the sampler $\Phi(\cdot)$ .

Two-Stage Training-Inference Alignment (TTA)

To reduce the excessive noise span presented to the model during inference, MIGA decomposes generation into two stages.

Stage 1: Zigzag Iterative Denoising The queue is initialized and maintained with a "zigzag" structure where the noise level changes only every $L_{zig}$ latents, slowing the rate of change.

Q_{s1} = \{ \underbrace{z^1_{\tau_e}, \cdots, z^{L_{zig}}_{\tau_e}}_{L_{zig}}, \underbrace{z^{L_{zig}+1}_{\tau_{e+1}}, \cdots, z^{2L_{zig}}_{\tau_{e+1}}}_{L_{zig}}, \cdots, \underbrace{z^{L-L_{zig}+1}_{\tau_T}, \cdots, z^{L}_{\tau_T}}_{L_{zig}} \}

At each iteration, $L_{zig}$ partially denoised latents are dequeued and $L_{zig}$ new Gaussian latents are enqueued.

Stage 2: Denoising at a Unified Noise Level After $n$ iterations in Stage 1, $nL_{zig}$ latents at the same noise level $\tau_{e-1}$ form queue $Q_{s2}$ . A unified denoising process is then applied:

Q_{s2} = \{z^1_{\tau_{e-1}}, z^2_{\tau_{e-1}}, ..., z^{nL_{zig}}_{\tau_{e-1}}\}

This aligns perfectly with the model's training condition (noise span of 1).

Dual Consistency Enhancement (DCE)

Self. Reflection A test-time scaling approach that efficiently evaluates and corrects early high-noise latents to prevent future inconsistencies.

Consistency Metric: Uses cosine similarity between latent features, avoiding external models. For evaluation latents $q_{eval} \in \mathbb{R}^{f_{eval} \times l \times c}$ and reference latents $q_{ref} \in \mathbb{R}^{f_{ref} \times l \times c}$ : $q'_{eval} = \text{norm}_1(\text{mean}_2(q_{eval})), \quad q'_{ref} = \text{norm}_1(\text{mean}_2(q_{ref}))$ $C_{score} = \text{mean}_1\left( \text{mean}_2\left( q'_{eval} {q'_{ref}}^T \right) \right)$ The paper finds this metric correlates strongly even at high noise levels.
Process: A judgment index $f_{judg}$ at the queue's tail is evaluated against preceding latents. If $C_{score}$ drops below a threshold $\delta_{adj}$ , an expanded search with $n_{samp}$ candidates is triggered for correction, guided by earlier consistent latents.

Long-Range Frame Guidance When the model processes a local window of latents starting at position $l$ , it explicitly incorporates $m_{guid}$ clean, sparsely sampled latents from earlier in the queue to guide denoising. For $l \in (m_{guid}, L - m_{guid}]$ , the model input becomes:

q_{input} = [z^1, ..., z^{m_{guid}}, z^l, ..., z^{l+f_0-m_{guid}-1}]

This facilitates feature interaction between distant frames.

Empirical Validation / Results

Experiments were conducted on VBench (video quality) and NarrLV (narrative expressiveness) using foundation models VideoCrafter2 and Wan2.1.

Quantitative Results on VBench

Table 1: Quantitative results on VBench. Best results in bold.

Method	Infinite	S.C.	B.C.	M.S.	T.F.	O.S.
VideoCrafter2-Based
FreePCA	✗	93.57	95.24	93.73	91.27	93.45
FreeLong	✗	95.72	96.42	98.38	97.28	96.95
FIFO-Diffusion	✓	92.92	95.01	97.19	94.94	95.02
ScalingNoise	✓	94.29	95.52	97.86	96.12	95.95
MIGA (ours)	✓	97.66	96.99	98.60	98.03	97.82
Wan2.1-Based
FIFO-Diffusion	✓	92.67	93.37	98.03	97.09	95.29
MIGA (ours)	✓	96.46	95.50	98.85	98.14	97.24

MIGA achieves SOTA across all metrics for both foundation models.

Quantitative Results on NarrLV

Table 2: Quantitative results on NarrLV under varying TNA settings.

Method	Infinite	TNA=2			TNA=3			TNA=4
		s_att	t_att	t_act	s_att	t_att	t_act	s_att	t_att	t_act
VideoCrafter2-Based
FreePCA	✗	56.96	58.72	56.41	53.61	53.93	52.57	50.46	57.28	53.27
FreeLong	✗	59.43	59.57	55.95	56.57	59.82	56.57	54.13	60.53	54.13
ScalingNoise	✓	59.28	55.47	58.09	53.27	58.14	54.05	52.37	58.41	53.59
FIFO-Diffusion	✓	67.02	63.55	58.29	61.15	60.64	58.42	66.09	66.01	54.66
MIGA (ours)	✓	69.78	63.94	59.01	63.53	61.05	59.52	68.87	68.77	55.78
Wan2.1-Based
FIFO-Diffusion	✓	67.77	64.25	65.40	55.42	59.02	58.91	57.43	56.10	53.89
MIGA (ours)	✓	79.32	67.87	67.94	69.48	66.33	63.86	75.05	72.31	62.90

MIGA demonstrates superior narrative expressiveness, especially with the more powerful Wan2.1 backbone.

Ablation Studies

Core components (TTA and DCE) are validated. FIFO-Diffusion is the baseline.

Table As presented in the paper: Ablation results of core mechanisms.

TTA	DCE	S.C.	B.C.	M.S.	T.F.	O.S.
		92.92	95.01	97.19	94.94	95.02
✓		96.74	96.75	97.57	97.12	97.05
	✓	96.10	96.47	97.88	96.56	96.75
✓	✓	97.66	96.99	98.60	98.03	97.82

Table 4 (from paper): Ablation results of $L_{zig}$ .

$L_{zig}$	S.C.	B.C.	M.S.	T.F.	O.S.
1	94.23	94.52	97.98	96.47	95.80
2	94.24	95.93	98.55	97.90	96.66
4	95.37	95.96	98.65	98.02	97.00
6	95.14	96.04	98.60	97.97	96.94
8	95.54	95.96	98.56	97.90	96.99

Table 5 (from paper): Ablation results of $m_{guid}$ .

$m_{guid}$	S.C.	B.C.	M.S.	T.F.	O.S.
0	94.23	94.52	97.98	96.47	95.80
2	94.66	94.72	98.64	98.05	96.52
4	94.59	94.58	98.64	98.10	96.48
6	95.45	95.69	98.45	97.89	96.87
8	95.32	95.12	98.60	98.05	96.77

Key Findings:

Both TTA and DCE individually improve performance, with combined use yielding the best results.
Increasing $L_{zig}$ (reducing noise span) improves performance up to a point ( $L_{zig}=4$ is optimal in this setup).
Long-range guidance ( $m_{guid}=6$ ) provides consistent benefits.
Self-reflection threshold $\delta_{adj}$ controls the trade-off between performance gain and computational cost.

Qualitative Results

MIGA generates high-quality, consistent long videos (1000+ frames). Visual results show significant reduction in noise and content drift compared to the baseline (see Fig. 4 in the paper).

Theoretical and Practical Implications

Theoretical: Provides a principled framework to address the training-inference gap in autoregressive diffusion models by actively managing noise span. Introduces a self-supervised consistency metric based on latent similarity, demonstrating its correlation across noise levels.
Practical: Enables the generation of infinitely long, coherent videos using existing foundation models without any training, dramatically lowering the barrier for applications like film pre-visualization, game development, and long-form content creation. The method maintains constant memory usage.

Conclusion

MIGA successfully enhances train-free infinite-frame video generation by proposing:

A Two-Stage Training-Inference Alignment (TTA) mechanism that proactively reduces the noise span mismatch.
A Dual Consistency Enhancement (DCE) mechanism that ensures long-term coherence through self-reflection and long-range guidance.

Extensive experiments confirm MIGA's state-of-the-art performance in generating consistent, narrative-rich long videos. Future work may focus on incorporating additional conditioning (e.g., physical laws) to mitigate rare hallucination issues and generate even more realistic content.