Visual Summary | DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

Summary (Overview)

Open-domain subject-driven video generation (S2V) faces a key challenge: achieving high subject fidelity in in-domain scenarios while maintaining generative flexibility for cross-domain scenarios (e.g., style transfer, fantasy-to-real, complex interactions). Existing methods prioritize in-domain fidelity, limiting cross-domain adaptability.
DomainShuttle proposes a novel architecture that decouples video and reference image processing into independent branches, enabling both high fidelity and flexible controllability across open domains.
Three core components: Domain-MoT (decoupled attention with domain-aware AdaLN), Video-Reference DualRoPE (separate RoPE spaces for precise subject-level spatial modeling), and Cross-Pair Consistent Loss (CCL) (aligning multiple reference sets to extract intrinsic subject features).
Extensive experiments show DomainShuttle achieves state-of-the-art performance, with an 18.7% improvement in Cross-Domain Score over existing methods, along with superior human preference ratings.
The approach is validated on two base models (Wan2.1-14B and Wan2.2-14B) and demonstrates strong generalization across diverse scenarios: real-to-fantasy, fantasy-to-real, multi-subject interactions, and background preservation.

Introduction and Theoretical Foundation

Subject-driven text-to-video (S2V) generation aims to synthesize videos that preserve specified subject features (identity, style, domain semantics) from reference images, guided by text prompts. The paper defines two key scenarios:

In-domain: Retain as many subject features as possible without altering attributes or style.
Cross-domain: Preserve intrinsic subject features (e.g., hairstyle, skin color, shape) while allowing subject-irrelevant properties (lighting, style, domain attributes) to flexibly adapt according to text instructions.

Existing S2V methods [1-12] primarily focus on maximizing subject fidelity in in-domain scenarios, limiting editability in cross-domain applications. The authors argue that an ideal S2V method should "freely shuttle between different domains" — achieving high fidelity and generative flexibility simultaneously.

The key challenge is the entanglement between intrinsic subject features and domain-specific attributes in reference images, which makes flexible domain transfer difficult. DomainShuttle addresses this by introducing independent information processing paths for video and reference branches, along with domain-aware modeling.

Methodology

DomainShuttle is built on a DiT-based video generation model (Wan2.1/2.2) and consists of three main modules.

3.1 Preliminaries

The model is optimized using flow-matching loss [33]:

\mathcal{L}_{\text{FM}} = \mathbb{E}_{t, z_0, z_1} \| G_\theta(z_t, t, c_t, c_r) - (z_1 - z_0) \|_2^2

where $t \in [0,1]$ , $z_0$ is a prior sample, $z_1$ is the video latent, $c_t$ and $c_r$ are text features and reference image features, and $G_\theta$ is the learnable vector field.

3.2 Model Architecture

3.2.1 Domain-MoT (Mixture-of-Transformers)

Domain-MoT decouples video latents and reference image features into two independent processing paths, with explicit domain attribute injection via Domain-aware AdaLN. This preserves the base video model's capability while allowing flexible personalizationThe in-context self-attention mechanism uses independent QKV projections and RoPE for video ( $v$ ) and reference ( $r$ ) tokens, as shown in Equation (5) of the paper reproduced below for completeness (adapted for clarity in-context self-attention formula from the original Equation (1) in Section 3.2.1 of methodology which<｜begin▁of▁file｜>Softmax\left( \frac{[R_v(Q_v); R_r(Q_r)] \cdot [R_v(K_v); R_r(K_r)]}{\sqrt{d}} \right) [V_v, V_r]

where $R_v$ and $R_r$ denote RoPE applied to video and reference branches, with independent weight matrices $W^q, W^k, W^v$ for each branch. Textual cross-attention is frozen during training to preserve text guidance.

Domain-aware AdaLN modulates video and reference features differently:

\begin{aligned} \hat{f}_v &= g_v(t) \odot \left[ \text{LN}(f_v) \odot (1 + \gamma_v(t)) + \beta_v(t) \right] + f_v \\ \hat{f}_r &= g_r(t, a) \odot \left[ \text{LN}(f_r) \odot (1 + \gamma_r(t, a)) + \beta_r(t, a) \right] + f_r \end{aligned}

where $t$ denotes time features, $a \in \{A_1, A_2, \dots, A_K\}$ denotes one of $K$ domain attributes (real-world human, real-world object, background, fantasy subject), $\gamma, \beta, g$ are scale, shift, and gate parameters conditioned on $t$ and $a$ .

3.2.2 Video-Reference DualRoPe

Standard RoPE in DiT-based models [17, 20] assigns positional indices $(i, j, k)$ to each video token (frame, height, width). Existing methods treat reference images as additional video frames, ignoring subject identity across multiple references.

Video-Reference DualRoPe assigns reference tokens to a separate RoPE space:

\begin{aligned} R_v(i, j, k) &= \theta(i+1, j, k) \\ R_r(i, j, k) &= \theta(0, j + h \times (m+1), k + w \times (n+1)) \end{aligned}

where $m \in [0, M-1]$ indexes the reference subject, $n \in [0, N-1]$ indexes the reference image, and $\theta$ is the rotation function. The temporal index for reference images is fixed to 0 (while video starts from 1), and offsets $(\Delta = (0, h, w))$ separate different subjects, while $(\Delta = (0, 0, w))$ keeps images of the same subject close in latent space.

3.2.3 Cross-Pair Consistent Loss (CCL)

To extract intrinsic subject features unaffected by viewpoint, occlusion, illumination, or style, CCL aligns features from two different reference sets for the same video:

\mathcal{L}_C = \| G_\theta(z_t, t, c_t, c_r) - G^*_\theta(z_t, t, c_t, c^*_r) \|_2^2

where $c_r$ and $c^*_r$ are features from two randomly sampled reference sets. $G^*_\theta$ is frozen; $G_\theta$ is trainable. This forces the model to learn shared, intrinsic subject representations.

3.3 Training Data

A 750K high-quality open-domain video personalization dataset is constructed from Phantom-Data [36], OpenS2V [37], and Ditto-1M [38], covering humans, objects, fantasy subjects, and backgrounds. Cross-pair configurations include both "multiple reference set → single video" and "single reference set → multiple videos".

Empirical Validation / Results

4.1 Experimental Setup

Base models: Wan2.1-14B-T2V and Wan2.2-14B-T2V.
Two-stage training: 2K steps on 200K image data (batch size 96), then 12K steps on 750K video data (batch size 64). Cross-attention frozen in stage 2.
Test set: 110 in-domain + 110 cross-domain samples.
Metrics: AES, MS (video quality); GMEScore (text controllability); DINO-I, CLIP-I (in-domain subject consistency); NANO-CLIP, Qwen-CLIP, CD-Score, Qwen-Score (cross-domain consistency).

4.2 Main Results

Quantitative results (Table 1):

Method	AES	MS	GMEScore	NANO-CLIP	Qwen-CLIP	CD-Score	Qwen-Score	DINO-I	CLIP-I
Kling 1.6	0.515	0.965	0.596	0.621	0.640	0.725	0.771	0.401	0.672
VACE-Wan2.1	0.517	0.985	0.671	0.622	0.644	0.538	0.769	0.326	0.695
Phantom	0.515	0.972	0.660	0.602	0.645	0.506	0.703	0.322	0.701
Ours (Wan2.1)	0.510	0.977	0.689	0.627	0.647	0.787	0.781	0.405	0.703
Ours (Wan2.2)	0.516	0.987	0.705	0.636	0.658	0.861	0.829	0.400	0.690

DomainShuttle achieves the best text controllability and cross-domain subject consistency, with an 18.7% improvement in CD-Score over the next best baseline.

Qualitative results (Fig. 3, Fig. 4): DomainShuttle successfully transfers real-world subjects to fantasy domains (e.g., watercolor, 3D animation) while preserving intrinsic features, maps fantasy subjects to real-world objects (e.g., character on a bus), and handles complex interactions between real and fantasy subjects.

4.3 Ablation Study

Table 2: Ablation of essential modules

ID	Setting	GMEScore	NANO-CLIP	CD-Score	DINO-I	CLIP-I
0	Naive Method	0.664	0.601	0.697	0.356	0.675
1	+ Dual Self-Attn	0.671	0.609	0.715	0.367	0.683
2	+ Domain-MoT	0.687	0.627	0.783	0.396	0.697
3	+ VR-DualRoPE	0.691	0.629	0.813	0.394	0.688
4	+ CCL (Full)	0.705	0.636	0.861	0.400	0.690

Domain-MoT (ID-2 vs ID-1) provides the largest improvement, increasing CD-Score from 0.715 to 0.783.
VR-DualRoPE improves cross-domain subject consistency (CD-Score from 0.783 to 0.813) by correctly separating subject spatial relationships.
CCL further boosts CD-Score by 5.9% (0.813 → 0.861) by learning intrinsic features, with minimal impact on in-domain fidelity (DINO +1.5%, CLIP +0.3%).
The subject-decoupled offset strategy for VR-DualRoPE outperforms reference-decoupled offset (Fig. 5c).

4.4 Human Preference Evaluation

40 volunteers ranked videos across methods. DomainShuttle significantly outperforms baselines in video quality, text controllability, and open-domain subject consistency (Fig. 6).

Theoretical and Practical Implications

Theoretical contribution: The decoupled architecture (Domain-MoT) with domain-aware AdaLN provides a principled way to handle the intrinsic vs. irrelevant feature disentanglement problem in subject-driven generation. This is a key advancement over entanglement-prone methods.
Video-Reference DualRoPE offers a novel spatial modeling strategy that explicitly separates subjects in the latent space while binding same-subject references — addressing a fundamental limitation of treating references as video frames.
Cross-Pair Consistent Loss is a simple yet effective regularization that forces the model to learn invariant subject representations, analogous to contrastive learning but applied at the feature level.
Practical impact: DomainShuttle enables creative applications such as AI filmmaking, advertising, and interactive content creation where subjects need to be flexibly transferred across domains while preserving identity. The 18.7% CD-Score improvement demonstrates significant practical value.
The method is model-agnostic (demonstrated on both Wan2.1 and Wan2.2) and the components are modular, suggesting potential for integration into future base models.

Conclusion

DomainShuttle introduces a novel architecture for open-domain subject-driven text-to-video generation that achieves both high fidelity and generative flexibility. Three key components — Domain-MoT (with domain-aware AdaLN), Video-Reference DualRoPE, and Cross-Pair Consistent Loss — work together to decouple intrinsic subject features from domain-specific attributes, enabling precise subject preservation while allowing flexible adaptation to text instructions. Extensive experiments show significant improvements over state-of-the-art methods, particularly in cross-domain scenarios (18.7% CD-Score improvement). The approach generalizes well across humans, objects, fantasy subjects, and backgrounds, demonstrating strong potential for creative applications. Future work could explore extending the framework to longer videos, more domains, and interactive user control.