Visual Summary | OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Summary (Overview)

OmniDirector proposes a unified framework for multi-shot camera cloning from reference videos without requiring cross-paired training data (videos with identical camera motion but different content).
A novel camera grid representation encodes camera motion as a visual grid video rendered from an empty 3D scene, enabling decoupling of camera signals from appearance and scalability to large-scale datasets.
A million-scale camera grid–video dataset (1.8M videos) is constructed automatically from internet videos, enabling robust training of a Multi-Modal Diffusion Transformer (MMDiT).
A hierarchical Prompt Expansion (PE) Agent at inference harmoniously integrates camera motion, subject, and action signals into a unified textual prompt for precise multi-modal control.
Extensive experiments show superior performance over baselines (CamCloneMaster, Seedance2.0, LTX-LoRA) in camera accuracy (e.g., T-Pre 72.74% vs. next best 52.21%), transition accuracy, and leakage reduction.

Introduction and Theoretical Foundation

Camera control is crucial for video generation but existing methods face trade-offs between precision (explicit camera parameters) and accessibility (textual descriptions). Video-referenced camera cloning offers a balance, but current approaches either:

Rely on explicit parameter extraction that suffers from scale ambiguities across different scenes.
Train on cross-paired data (pairs of videos with identical camera motion but different content), which is extremely scarce in real-world scenarios and leads to limited generalization, especially for multi-shot videos.

To address these issues, the paper introduces the camera grid – a visual representation that encodes camera motion as a video of an empty 3D room with grid lines. This representation is:

General: Handles both single-shot and multi-shot videos uniformly.
Decoupled: The empty scene prevents information leakage (e.g., appearance, object motion) from the reference video.
Scalable & Compatible: Any video can be automatically paired with its camera grid, enabling large-scale data construction. The grid is a spatiotemporal signal like video, making it easy for diffusion models to process.

Methodology

Camera Grid Representation

Given camera extrinsics (\mathbf{P} = { \mathbf{R}_i, \mathbf{t}i }{i=1}^T) from a reference video:

Spatial scene modeling: Define floor and ceiling at heights [ y_{\text{floor}} = \bar{y} - \Delta h, \quad y_{\text{ceiling}} = \bar{y} + \Delta h \tag{1,2} ] where (\bar{y}) is the average camera height, (\Delta h) proportional to the median inter-pose distance.
Tunnel wall: Vertical lines connect floor and ceiling within an annular region [ W = { (x,z) \mid r < d_{\text{traj}}(x,z) < r + \delta } \tag{3} ] where (d_{\text{traj}}) is distance to the projected camera trajectory, (r) inner radius, (\delta) wall thickness.
Rendering: For each frame, project grid endpoints via camera extrinsic ([\mathbf{R}_i|\mathbf{t}_i]) and intrinsic matrices.
Special effects:
- Fisheye: Use Kannala–Brandt model with distortion angle [ \theta = \arctan(r'/\zeta), \quad \theta_d = \theta(1 + k_1\theta^2 + k_2\theta^4 + k_3\theta^6 + k_4\theta^8) \tag{4,5} ]
- Dolly zoom: Focal length (\phi \propto \rho) (distance to subject) keeping subject size constant. \tag{6}
Multi-shot handling: Detect transitions (TransNet-V2), render each sub-clip separately, and insert special white frames at transition points.

OmniDirector Architecture

Token concatenation: Encode camera grid (G) and reference image (I) via 3D-VAE into latents (\mathbf{z}_c, \mathbf{z}_I). Concatenate with noisy video latent (\mathbf{z}v) along frame dimension: [ \mathbf{z}{\text{vis}} = \text{Concat}(\mathbf{z}_I, \mathbf{z}v, \mathbf{z}c) \in \mathbb{R}^{(2T+1) \times H \times W \times C} ] Then patchify to tokens (\mathbf{Z}{\text{vis}} \in \mathbb{R}^{N{\text{vis}} \times D}). \tag{7}
MMDiT blocks: Joint attention between visual and text tokens across separate streams: [ \mathbf{Z}^{(l+1)}{\text{vis}} = \text{FFN}\big(\text{Attention}(\text{LN}(\mathbf{Z}^{(l)}{\text{vis}}), \mathbf{Z}^{(l)}t)\big) + \mathbf{Z}^{(l)}{\text{vis}} \tag{8} ]
Training: 30% of samples use self-reconstruction (model reconstructs camera grid itself) to enforce geometric understanding; remaining 70% use standard camera-conditioned video generation.

Hierarchical Prompt Expansion Agent (Inference)

Camera prompt generation: A fine-tuned Qwen3-VL generates camera motion descriptions (\mathcal{T}_c) directly from camera parameters (poses + transitions). It decomposes into:
- Inter-shot: Relationship between adjacent shots (e.g., transition style).
- Intra-shot: Per-shot motion description via pose analysis (translation/rotation axes, speed, arc shot detection using rules like Eqs. 9–10).
Semantic fusion: The agent integrates camera prompt (\mathcal{T}_c), reference image (I), and user prompt (\mathcal{T}_u) into a final cohesive text (\mathcal{T}_f) via Qwen3-VL.
Adaptive CFG: Use black background as unconditional visual input and "completely static camera" as negative text. Apply coarse-to-fine denoising: camera grid features dominate early (high-noise) stages to set global structure; other signals refine later.

Empirical Validation / Results

Evaluation Metrics

Camera Control: Relative Rotation Error (RRE), Relative Translation Error (RTE), and precision thresholds: R-Pre (<4°), T-Pre (<20°).
Transition Accuracy: Tem-Pre (temporal alignment <3 frames), Sem-Pre (semantic type match via Gemini 3.1 Pro).
Leakage Rate: Percentage of frames/shots that leak content from the reference video.
GSB Pairwise Comparison: Good/Same/Bad vs. CamCloneMaster across camera, quality, narrative.

Quantitative Results

Table 1: Quantitative Comparisons

Method	RRE(°)↓	R-Pre(%)↑	RTE(°)↓	T-Pre(%)↑	Tem-Pre(%)↑	Sem-Pre(%)↑	Frame Leakage(%)↓	Shot Leakage(%)↓
Seedance2.0	8.33	56.49	49.98	29.07	4.17	–	4.43	20.90
CamCloneMaster	4.11	74.14	27.45	52.21	2.20	–	1.60	11.59
LTX-LoRA	5.67	66.34	26.96	52.07	38.94	29.55	15.04	56.52
Ours	2.64	83.18	16.84	72.74	96.52	83.79	0.51	3.38

OmniDirector outperforms all baselines significantly, especially in T-Pre (39.3% relative improvement) and transition accuracy (Tem-Pre 96.52%), with minimal leakage.

Table 2: Ablation Studies

Setting	RRE↓	R-Pre↑	RTE↓	T-Pre↑	Tem-Pre↑	Sem-Pre↑	Shot Leakage↓
w/o Semantic Fusion	3.85	78.20	19.90	67.45	94.40	78.30	4.10
w/o Trans PE	2.71	81.50	17.10	71.25	93.35	38.45	3.45
w/o AdaCFG	4.15	74.55	21.41	62.30	94.10	80.20	3.83
Full	2.64	83.18	16.84	72.74	96.52	83.79	3.38

All components (semantic fusion, inter-shot prompt, adaptive CFG) contribute to performance, with inter-shot prompt critical for Sem-Pre (drop from 83.79% to 38.45%).

Table 3: GSB Comparison Ours vs. CamCloneMaster

Dimension	(G+S)/T (%)	G/(G+B) (%)	(G+S)/(B+S)
Camera	88.52	86.29	3.19
Quality	95.69	90.82	1.67
Narrative	94.26	85.71	1.44
Average (3:1:1)	91.10	87.08	2.54

Our method is preferred across all dimensions.

Qualitative Results

Figure 6: OmniDirector accurately clones multi-shot camera motions (pull back, pan, orbit) while baselines fail or leak content.
Figure 7: Ablation visualizations show that adaptive CFG prevents slow camera rotation; semantic fusion ensures plausible content; inter-shot prompt prevents random scene transitions.

Emergent Capabilities

Without retraining, conditioning on raw reference videos or Canny edge sequences robustly drives camera motion, demonstrating zero-shot generalization and ability to clone complex effects like Hitchcock zoom and fisheye distortion.

Theoretical and Practical Implications

Theoretical: The camera grid representation bridges the modality gap between parametric camera control and visual signals, enabling diffusion models to learn camera dynamics from massive internet-scale data. The self-reconstruction training strategy forces the model to understand grid geometry rather than treat it as a weak condition.
Practical: OmniDirector provides an intuitive way for users to clone camera motions from any reference video (including multi-shot) without manual parameter tuning or cross-paired data collection. The hierarchical prompt agent allows seamless integration with other controls (subject, action). This opens applications in filmmaking, advertising, and content creation where complex camera choreography is needed.
Leakage reduction: Near-zero leakage (0.51% frame, 3.38% shot) ensures the generated video faithfully adopts new content while keeping only the camera motion.

Conclusion

OmniDirector achieves general multi-shot camera cloning without cross-paired data by introducing:

A camera grid visual representation that decouples camera motion from other signals and enables large-scale training.
A million-scale camera grid–video dataset for training MMDiTs.
A hierarchical Prompt Expansion Agent that integrates camera, subject, and action signals into a unified text prompt.

Extensive experiments demonstrate superior camera accuracy, transition precision, and low leakage. Future work includes exploring advanced temporal memory mechanisms (long-context cross-attention, memory banks) to handle significantly longer video sequences.