Visual Summary | LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

Summary (Overview)

Novel Streaming Editing Framework: LiveEdit introduces a causal, chunk-by-chunk video editing pipeline that achieves real-time performance (12.66 FPS) while strictly preserving unedited background regions.
Three-Stage Distillation Pipeline: A progressive distillation strategy (Foundation Tuning → Teacher Forcing → DMD) transfers editing capabilities from a powerful bidirectional Diffusion Transformer (DiT) to an efficient 4-step causal DiT, resolving attention distribution shift.
AR-Oriented Mask Cache: An inference-time caching mechanism dynamically reuses self-attention features for static background tokens based on an L₂ distance mask, reducing per-frame latency to 79 ms without degrading visual quality.
State-of-the-Art Results: LiveEdit achieves the best Text Alignment (0.270), Background Consistency (0.956), and Dynamic Degree (0.256) among streaming baselines, while outperforming offline bidirectional models in instruction adherence.
Dedicated Benchmark: A new streaming video editing benchmark of 120 video-text pairs is established for fair evaluation, along with user studies confirming the method’s superiority in fidelity and temporal coherence.

Introduction and Theoretical Foundation

Streaming video editing—processing video chunk-by-chunk in real-time—is essential for augmented reality and live applications but faces two core bottlenecks:

Attention distribution shift: Offline bidirectional diffusion models rely on future frames for temporal consistency. Directly truncating future keys and values for causal execution causes attention weights to spread uniformly over history (Fig. 3), leading to flickering and “forgetting.”
Spatial-temporal token redundancy: In streaming settings, unedited background regions remain static or undergo predictable motion. Standard diffusion pipelines blindly compute dense feed-forward network (FFN) and attention modules for every token, causing prohibitive latency and disrupting background stability.

Existing streaming generation methods (e.g., StreamDiffusion, StreamDiffusionV2) are designed for free-form synthesis, not for strict region-preserving editing. Similarly, offline editing models (InsV2V, LucyEdit) cannot operate causally and incur high latency.

LiveEdit addresses both issues by (a) progressively aligning a causal DiT with a bidirectional teacher via three-stage distillation, and (b) introducing a mask cache that reuses self-attention features for static regions, achieving both high fidelity and real-time speed.

Methodology

Three-Stage Distillation Pipeline

Let $\mathbf{z}_0 \in \mathbb{R}^{F \times C \times H \times W}$ be the input video latent sequence and $c$ the text embedding. The pipeline transfers editing knowledge from a bidirectional DiT to a 4-step causal DiT.

Stage 1 – Foundation Tuning (Editing Ability Acquisition):
A bidirectional DiT $\epsilon^{\text{bid}}_\theta$ is trained on 20K video-text pairs using channel-wise concatenation of source latent and noisy latent $z_t$ . It learns complex editing mappings via full spatial-temporal attention under the MSE loss:

\mathcal{L}^{\text{bid}}_{\text{MSE}} = \mathbb{E}_{z_0,\epsilon\sim\mathcal{N}(0,I),t,c} \left[ \left\| \epsilon - \epsilon^{\text{bid}}_\theta(z_t, t, c) \right\|_2^2 \right]

Stage 2 – Teacher Forcing for Chunk-Wise Causal Initial:
The architecture is converted to a causal DiT $\epsilon^{\text{causal}}_\theta$ by introducing chunk-wise causal attention masks $M_{\text{causal}}$ . The model is fine-tuned with teacher forcing to align its output distribution with the Stage 1 bidirectional prior, preventing structural collapse:

\mathcal{L}^{\text{causal}}_{\text{MSE}} = \mathbb{E}_{z_0,\epsilon,t,c} \left[ \left\| \epsilon - \epsilon^{\text{causal}}_\theta(z_t, t, c \mid M_{\text{causal}}) \right\|_2^2 \right]

Stage 3 – DMD for Streaming Video Editing:
A 4-step generator $G_\theta$ is initialized from Stage 2 weights (bypassing expensive ODE initialization). Distribution Matching Distillation (DMD) compresses inference using a frozen Real Score $\epsilon^{\text{real}}_\phi$ and a trainable Fake Score $\epsilon^{\text{fake}}_\psi$ . The DMD gradient is:

\nabla_\theta \mathcal{L}_{\text{DMD}} = \mathbb{E}_{z_T, c} \left[ w(t) \left( \epsilon^{\text{real}}_\phi(z_t, t, c) - \epsilon^{\text{fake}}_\psi(z_t, t, c) \right) \nabla_\theta G_\theta(z_T, c) \right]

where $w(t)$ is a timestep-dependent weighting. The final generator produces edits in 4 steps without classifier-free guidance.

AR-Oriented Mask Cache

During streaming inference, for incoming chunk $k$ , an editing mask $\mathbf{M}^k \in \{0,1\}^{H \times W}$ is derived from the L₂ distance between the previous source latent $\mathbf{z}^{k-1}_{\text{src}}$ and edited latent $\mathbf{z}^{k-1}_{\text{edit}}$ :

M^k_{u,v} = \mathbb{I}\left( \left\| \mathbf{z}^{k-1}_{\text{edit},u,v} - \mathbf{z}^{k-1}_{\text{src},u,v} \right\|_2 > \tau \right)

where $\tau$ dynamically prunes 70% of tokens. Tokens in unedited regions ( $M^k_{u,v}=0$ ) bypass self-attention computation and reuse cached features from the previous chunk. Only active editing tokens undergo full forward passes. This mechanism is applied exclusively to Self-Attention layers, as FFN caching leads to quality degradation (Table 3, Fig. 8).

Empirical Validation / Results

Quantitative Comparison (Table 1)

Method	TA ↑	BC ↑	MS ↑	DD ↑	AQ ↑	IQ ↑
LucyEdit	0.253	0.943	0.990	0.266	0.529	0.707
VideoCoF	0.245	0.953	0.991	0.094	0.542	0.709
InsV2V	0.259	0.943	0.986	0.196	0.577	0.708
StreamDiffusion	0.239	0.886	0.975	0.239	0.590	0.717
StreamDiffusionV2	0.252	0.951	0.992	0.264	0.539	0.653
StreamV2V	0.244	0.934	0.989	0.153	0.548	0.712
Ours (W/o Cache)	0.265	0.956	0.991	0.282	0.584	0.720
Ours (W/ Cache)	0.270	0.956	0.992	0.256	0.581	0.708

LiveEdit achieves best Text Alignment (0.270) and Background Consistency (0.956), while maintaining high Motion Smoothness and Dynamic Degree. The cache variant slightly improves TA and MS while preserving BC.

Ablation Study

Three-stage distillation (Table 2):

Stage 1 (Bidirectional, 100 NFEs, CFG) ⇢ 0.268 TA, 0.716 IQ.
Stage 2 (Causal, 100 NFEs, CFG) ⇢ 0.264 TA, 0.702 IQ.
Stage 3 (DMD, 4 NFEs, no CFG) ⇢ 0.265 TA, 0.720 IQ, latency reduced to 7.89s per 81 frames.

Cache placement (Table 3):

Method	TA ↑	BC ↑	MS ↑	DD ↑	AQ ↑	IQ ↑
W/o Cache	0.265	0.956	0.991	0.282	0.584	0.720
Cache on SA	0.270	0.956	0.992	0.256	0.581	0.708
Cache on FFN	0.236	0.841	0.982	0.017	0.440	0.513

Caching FFN features causes severe degradation (blurring, color distortion), confirming that Self-Attention layers exhibit high temporal redundancy (mean cosine similarity 0.893) while FFN does not (0.153, Fig. 8).

Qualitative results (Fig. 6, 7): LiveEdit accurately modifies target regions (e.g., changing a coat from dark brown to silver white, adding gold-rimmed glasses) while preserving background lighting, shadows, and subject identity. Baselines either fail to apply edits (StreamV2V), cause color bleeding (InsV2V), or induce structural collapse (StreamDiffusion).

Theoretical and Practical Implications

Theoretical: The three-stage distillation pipeline provides a principled method to transfer bidirectional editing priors to causal streaming settings, resolving the attention distribution shift problem. The AR-oriented Mask Cache reveals a critical functional divergence between Self-Attention (redundant) and FFN (detail-sensitive) layers, guiding efficient temporal reuse.
Practical: With 12.66 FPS and 79 ms per-frame latency, LiveEdit is suitable for real-time interactive applications such as AR content creation, live video editing, and telepresence. The strict background preservation ensures that edited outputs remain temporally coherent, eliminating flicker and structural drift.

Conclusion

LiveEdit introduces a streaming video editing framework that achieves high-fidelity, real-time causal editing. Its three-stage distillation pipeline transfers editing ability from a bidirectional DiT to a 4-step causal DiT, while the AR-oriented Mask Cache reduces computation by reusing self-attention features in static regions. Extensive evaluations demonstrate state-of-the-art performance in text alignment, background consistency, and inference speed. Future work may extend the framework to multi-modal control and explore adaptive cache strategies for dynamic scenes.