Summary (Overview)
- Novel Streaming Editing Framework: LiveEdit introduces a causal, chunk-by-chunk video editing pipeline that achieves real-time performance (12.66 FPS) while strictly preserving unedited background regions.
- Three-Stage Distillation Pipeline: A progressive distillation strategy (Foundation Tuning → Teacher Forcing → DMD) transfers editing capabilities from a powerful bidirectional Diffusion Transformer (DiT) to an efficient 4-step causal DiT, resolving attention distribution shift.
- AR-Oriented Mask Cache: An inference-time caching mechanism dynamically reuses self-attention features for static background tokens based on an L₂ distance mask, reducing per-frame latency to 79 ms without degrading visual quality.
- State-of-the-Art Results: LiveEdit achieves the best Text Alignment (0.270), Background Consistency (0.956), and Dynamic Degree (0.256) among streaming baselines, while outperforming offline bidirectional models in instruction adherence.
- Dedicated Benchmark: A new streaming video editing benchmark of 120 video-text pairs is established for fair evaluation, along with user studies confirming the method’s superiority in fidelity and temporal coherence.
Introduction and Theoretical Foundation
Streaming video editing—processing video chunk-by-chunk in real-time—is essential for augmented reality and live applications but faces two core bottlenecks:
- Attention distribution shift: Offline bidirectional diffusion models rely on future frames for temporal consistency. Directly truncating future keys and values for causal execution causes attention weights to spread uniformly over history (Fig. 3), leading to flickering and “forgetting.”
- Spatial-temporal token redundancy: In streaming settings, unedited background regions remain static or undergo predictable motion. Standard diffusion pipelines blindly compute dense feed-forward network (FFN) and attention modules for every token, causing prohibitive latency and disrupting background stability.
Existing streaming generation methods (e.g., StreamDiffusion, StreamDiffusionV2) are designed for free-form synthesis, not for strict region-preserving editing. Similarly, offline editing models (InsV2V, LucyEdit) cannot operate causally and incur high latency.
LiveEdit addresses both issues by (a) progressively aligning a causal DiT with a bidirectional teacher via three-stage distillation, and (b) introducing a mask cache that reuses self-attention features for static regions, achieving both high fidelity and real-time speed.
Methodology
Three-Stage Distillation Pipeline
Let be the input video latent sequence and the text embedding. The pipeline transfers editing knowledge from a bidirectional DiT to a 4-step causal DiT.
Stage 1 – Foundation Tuning (Editing Ability Acquisition):
A bidirectional DiT is trained on 20K video-text pairs using channel-wise concatenation of source latent and noisy latent . It learns complex editing mappings via full spatial-temporal attention under the MSE loss:
Stage 2 – Teacher Forcing for Chunk-Wise Causal Initial:
The architecture is converted to a causal DiT by introducing chunk-wise causal attention masks . The model is fine-tuned with teacher forcing to align its output distribution with the Stage 1 bidirectional prior, preventing structural collapse:
Stage 3 – DMD for Streaming Video Editing:
A 4-step generator is initialized from Stage 2 weights (bypassing expensive ODE initialization). Distribution Matching Distillation (DMD) compresses inference using a frozen Real Score and a trainable Fake Score . The DMD gradient is:
where is a timestep-dependent weighting. The final generator produces edits in 4 steps without classifier-free guidance.
AR-Oriented Mask Cache
During streaming inference, for incoming chunk , an editing mask is derived from the L₂ distance between the previous source latent and edited latent :
where dynamically prunes 70% of tokens. Tokens in unedited regions () bypass self-attention computation and reuse cached features from the previous chunk. Only active editing tokens undergo full forward passes. This mechanism is applied exclusively to Self-Attention layers, as FFN caching leads to quality degradation (Table 3, Fig. 8).
Empirical Validation / Results
Quantitative Comparison (Table 1)
| Method | TA ↑ | BC ↑ | MS ↑ | DD ↑ | AQ ↑ | IQ ↑ |
|---|---|---|---|---|---|---|
| LucyEdit | 0.253 | 0.943 | 0.990 | 0.266 | 0.529 | 0.707 |
| VideoCoF | 0.245 | 0.953 | 0.991 | 0.094 | 0.542 | 0.709 |
| InsV2V | 0.259 | 0.943 | 0.986 | 0.196 | 0.577 | 0.708 |
| StreamDiffusion | 0.239 | 0.886 | 0.975 | 0.239 | 0.590 | 0.717 |
| StreamDiffusionV2 | 0.252 | 0.951 | 0.992 | 0.264 | 0.539 | 0.653 |
| StreamV2V | 0.244 | 0.934 | 0.989 | 0.153 | 0.548 | 0.712 |
| Ours (W/o Cache) | 0.265 | 0.956 | 0.991 | 0.282 | 0.584 | 0.720 |
| Ours (W/ Cache) | 0.270 | 0.956 | 0.992 | 0.256 | 0.581 | 0.708 |
LiveEdit achieves best Text Alignment (0.270) and Background Consistency (0.956), while maintaining high Motion Smoothness and Dynamic Degree. The cache variant slightly improves TA and MS while preserving BC.
Ablation Study
Three-stage distillation (Table 2):
- Stage 1 (Bidirectional, 100 NFEs, CFG) ⇢ 0.268 TA, 0.716 IQ.
- Stage 2 (Causal, 100 NFEs, CFG) ⇢ 0.264 TA, 0.702 IQ.
- Stage 3 (DMD, 4 NFEs, no CFG) ⇢ 0.265 TA, 0.720 IQ, latency reduced to 7.89s per 81 frames.
Cache placement (Table 3):
| Method | TA ↑ | BC ↑ | MS ↑ | DD ↑ | AQ ↑ | IQ ↑ |
|---|---|---|---|---|---|---|
| W/o Cache | 0.265 | 0.956 | 0.991 | 0.282 | 0.584 | 0.720 |
| Cache on SA | 0.270 | 0.956 | 0.992 | 0.256 | 0.581 | 0.708 |
| Cache on FFN | 0.236 | 0.841 | 0.982 | 0.017 | 0.440 | 0.513 |
Caching FFN features causes severe degradation (blurring, color distortion), confirming that Self-Attention layers exhibit high temporal redundancy (mean cosine similarity 0.893) while FFN does not (0.153, Fig. 8).
Qualitative results (Fig. 6, 7): LiveEdit accurately modifies target regions (e.g., changing a coat from dark brown to silver white, adding gold-rimmed glasses) while preserving background lighting, shadows, and subject identity. Baselines either fail to apply edits (StreamV2V), cause color bleeding (InsV2V), or induce structural collapse (StreamDiffusion).
Theoretical and Practical Implications
- Theoretical: The three-stage distillation pipeline provides a principled method to transfer bidirectional editing priors to causal streaming settings, resolving the attention distribution shift problem. The AR-oriented Mask Cache reveals a critical functional divergence between Self-Attention (redundant) and FFN (detail-sensitive) layers, guiding efficient temporal reuse.
- Practical: With 12.66 FPS and 79 ms per-frame latency, LiveEdit is suitable for real-time interactive applications such as AR content creation, live video editing, and telepresence. The strict background preservation ensures that edited outputs remain temporally coherent, eliminating flicker and structural drift.
Conclusion
LiveEdit introduces a streaming video editing framework that achieves high-fidelity, real-time causal editing. Its three-stage distillation pipeline transfers editing ability from a bidirectional DiT to a 4-step causal DiT, while the AR-oriented Mask Cache reduces computation by reusing self-attention features in static regions. Extensive evaluations demonstrate state-of-the-art performance in text alignment, background consistency, and inference speed. Future work may extend the framework to multi-modal control and explore adaptive cache strategies for dynamic scenes.
Related papers
- ReFreeKV: Towards Threshold-Free KV Cache Compression
ReFreeKV achieves near-lossless KV cache compression across diverse inputs without pre-defined budgets, using a universal 1% Frobenius norm threshold.
- Improved Large Language Diffusion Models
iLLaDA, an 8B masked diffusion language model trained from scratch, achieves performance competitive with Qwen2.5 7B on multiple benchmarks.
- Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention
GQE applies mixture-of-experts to grouped-query attention, matching accuracy while activating only half the query heads and achieving 1.8x speedup for long contexts.