SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Summary (Overview)

Proposes SAMA, a framework that factorizes instruction-guided video editing into two complementary capabilities: Semantic Anchoring (SA) for precise instruction-aware structural planning and Motion Alignment (MA) for faithful temporal dynamics preservation.
Introduces a two-stage training pipeline: (1) A Factorized Pre-training stage that learns inherent semantic and motion representations from unpaired image/video data, which alone yields strong zero-shot editing ability; (2) A Supervised Fine-tuning stage on paired editing data to refine performance.
Achieves state-of-the-art (SOTA) performance among open-source models on major benchmarks (VIE-Bench, OpenVE-Bench, ReCo-Bench) and is competitive with leading commercial systems (e.g., Kling-Omni, Runway).
Reduces reliance on brittle external priors (e.g., VLM features, structural conditions) by enabling the diffusion backbone to internalize robust semantic-motion representations directly.

Introduction and Theoretical Foundation

Instruction-guided video editing requires simultaneously applying fine-grained semantic changes based on a text instruction while preserving the temporally coherent motion of the source video. Current models struggle with this balance, often leading to artifacts, identity drift, or diluted edits. Existing approaches frequently inject explicit external priors to mitigate these issues, but this reliance becomes a bottleneck for robustness and generalization.

The core thesis of SAMA is that the difficulty stems from a lack of factorization between semantic structure planning and motion modeling. Semantic edits are often sparse and stable, determinable from a few anchor frames, while motion coherence follows physical dynamics that can be learned from large-scale raw videos. Based on this observation, SAMA decomposes the problem:

Semantic Anchoring (SA): Establishes a reliable visual anchor by predicting semantic tokens and video latents at sparse anchor frames.
Motion Alignment (MA): Internalizes temporal dynamics via pre-training on motion-centric video restoration tasks.

The framework is built upon a video diffusion transformer trained via flow matching. The training objective minimizes the expected flow matching loss:

\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t, x_0, x_1} \| v_\theta(x_t, t) - (x_1 - x_0) \|_2^2,

where $x_1$ is the target video, $x_0$ is the Gaussian prior, and the network $v_\theta$ learns to regress the vector field $x_1 - x_0$ from the intermediate state $x_t = t x_1 + (1-t) x_0$ .

Methodology

The SAMA framework integrates Semantic Anchoring and Motion Alignment into a two-stage training strategy.

1. Model Architecture & Tokenization:

Built upon the Wan2.1-T2V-14B video diffusion model.
Videos are encoded into VAE latents. Source ( $z_s$ ) and noisy target ( $z_t$ ) latent token sequences are concatenated to form an in-context V2V input: $z = [z_s; z_t]$ .
Learned type embeddings (0 for source, 1 for semantic, 2 for target tokens) are added to disambiguate token roles, leading to faster convergence.

2. Semantic Anchoring (SA):

Goal: Provide instruction-consistent anchors on sparse frames to stabilize structural editing.
Process: For a video, $N$ frames are uniformly sampled as anchors. Each is encoded by a SigLIP image encoder into patch-level features, aggregated into $M$ local and 1 global semantic tokens, and projected into the latent space via an MLP to obtain $\hat{s}$ .
Training: The projected semantic tokens $\hat{s}$ are prepended to the target latent sequence, noised, and fed into the DiT. A prediction head attached to the final DiT layer outputs predicted semantic tokens $s$ .
Objective: Supervised by an $\ell_1$ loss combined with the flow matching loss: $\mathcal{L}_{\text{sem}} = \| \hat{s} - s \|_1, \quad \mathcal{L} = \mathcal{L}_{\text{FM}} + \lambda \cdot \mathcal{L}_{\text{sem}}.$ $\lambda$ is set to 0.1.

3. Motion Alignment (MA):

Goal: Align edited video motion with source dynamics.
Process: Applied during factorized pre-training. A motion-centric transformation $\mathcal{T}$ is applied only to the source video ( $\tilde{V}_s = \mathcal{T}(V_s)$ ), while the target remains unchanged, forcing the model to learn motion recovery.
Pretext Tasks: Three restoration-style perturbations are used (see Fig. 3):
1. Cube Inpainting: Mask a continuous temporal block.
2. Speed Perturbation: Temporally accelerate the video.
3. Tube Shuffle: Partition video into a 2x2x2 spatio-temporal grid and randomly permute tubes.
A short task token (e.g., [Complete the missing regions...]) is prepended to the editing instruction.

4. Two-Stage Training Pipeline:

Stage 0: Factorized Pre-training: Trained on a mixture of instruction-based image editing data and large-scale text-to-video data (Koala-36M, MotionBench). Applies SA to all samples and MA only to video streams. This stage alone yields strong zero-shot editing capability.
Stage 1: Supervised Fine-tuning (SFT): Fine-tuned on paired video editing datasets (Ditto-1M, OpenVE-3M, ReCo-Data) mixed with some image data. SA remains enabled to maintain stable anchoring.

Table: Training Data Statistics

Training Stage	Dataset	# Pairs	Type
Stage 0	NHR-Edit, GPT-Image-Edit, X2Edit	~2.5M	Image Editing
Factorized Pre-training	Koala-36M, MotionBench	~1.59M	Text-to-Video (for MA)
Stage 1	NHR-Edit, Pico-Banana-400K	~0.98M	Image Editing
Supervised Fine-tuning	Ditto-1M, OpenVE-3M, ReCo-Data	~4.96M	Video Editing

Empirical Validation / Results

SAMA was evaluated against leading open-source and commercial models on three benchmarks.

1. Quantitative Results on VIE-Bench: SAMA achieves the best overall performance among open-source models on Swap/Change and Remove tasks, and is highly competitive with top commercial systems (Kling-Omni, Runway).

Table: VIE-Bench Results (Selected Tasks)

Method	Instruct Follow	Preservation	Quality	Avg.
Kling-Omni	9.333	9.589	8.622	9.181
Runway	8.607	8.913	7.823	8.447
SAMA (Ours)	8.467	9.422	8.244	8.711
UniVideo	8.567	9.422	7.978	8.656
InstructX	8.446	8.683	7.919	8.349

Method	Instruct Follow	Preservation	Quality	Avg.
SAMA (Ours)	9.533	9.189	8.711	9.144
Kling-Omni	9.378	9.233	8.789	9.133
UniVideo	8.133	8.778	7.789	8.233
InstructX	8.627	8.668	7.672	8.322

2. Results on OpenVE-Bench and ReCo-Bench:

On OpenVE-Bench, SAMA achieves the top overall score among compared models.
On ReCo-Bench, SAMA attains the best or competitive scores across all editing tasks (Add, Replace, Remove, Style) and evaluation dimensions (Edit Accuracy, Video Naturalness, Video Quality).

3. Qualitative Results: Visual comparisons show SAMA follows fine-grained instructions more reliably (handling relative positions, attribute constraints) and maintains superior temporal consistency and motion preservation compared to prior methods.

4. Zero-shot Video Editing: The model after Stage 0 (Factorized Pre-training) demonstrates strong zero-shot editing capabilities across Replace, Add, Remove, and Style tasks, validating that robust editing emerges from learning disentangled semantic and motion representations.

5. Ablation Studies:

Semantic Anchoring (SA): Accelerates DiT convergence, stabilizes training (reduced loss variance), and improves quantitative scores. Models with SA produce higher-quality edits earlier in training.
Motion Alignment (MA): Improves temporal consistency under fast motion and alleviates motion blur. Qualitative results show clearer backgrounds and preserved motion dynamics with MA enabled.

Table: Ablation of SAMA Modules on VIE-Bench

Method	Instruct Follow	Preservation	Quality	Overall
Baseline	6.575	6.261	6.100	6.312
w/ SA	7.002	6.744	6.342	6.696
w/ MA	6.969	6.620	6.544	6.711
SAMA (Full)	7.402	6.998	6.884	7.095

Theoretical and Practical Implications

Theoretical: Proposes a novel, factorized perspective on instruction-guided video editing, separating the learnable problems of semantic intent planning and temporal dynamics modeling. This reduces the need for task-specific external priors.
Practical: Demonstrates that a two-stage training strategy—first learning general representations and then fine-tuning on task-specific data—is highly effective. The emergence of zero-shot ability after pre-training suggests a more data-efficient path to building robust video editors.
Impact: SAMA sets a new state-of-the-art for open-source video editing models, closing the gap with leading commercial systems. The release of code, models, and datasets will facilitate further research and application.

Conclusion

SAMA presents a successful framework for instruction-guided video editing by factorizing the problem into Semantic Anchoring and Motion Alignment. Through a two-stage training pipeline, it enables the model to internalize robust semantic and temporal representations, leading to SOTA performance. The framework's strong zero-shot capability confirms the validity of its factorized design. Future work may focus on extending SAMA to long-video editing, fast-motion scenarios, and exploring stronger semantic tokenization methods.