Summary (Overview)

  • OmniDirector proposes a unified framework for multi-shot camera cloning from reference videos without requiring cross-paired training data (videos with identical camera motion but different content).
  • A novel camera grid representation encodes camera motion as a visual grid video rendered from an empty 3D scene, enabling decoupling of camera signals from appearance and scalability to large-scale datasets.
  • A million-scale camera grid–video dataset (1.8M videos) is constructed automatically from internet videos, enabling robust training of a Multi-Modal Diffusion Transformer (MMDiT).
  • A hierarchical Prompt Expansion (PE) Agent at inference harmoniously integrates camera motion, subject, and action signals into a unified textual prompt for precise multi-modal control.
  • Extensive experiments show superior performance over baselines (CamCloneMaster, Seedance2.0, LTX-LoRA) in camera accuracy (e.g., T-Pre 72.74% vs. next best 52.21%), transition accuracy, and leakage reduction.

Introduction and Theoretical Foundation

Camera control is crucial for video generation but existing methods face trade-offs between precision (explicit camera parameters) and accessibility (textual descriptions). Video-referenced camera cloning offers a balance, but current approaches either:

  • Rely on explicit parameter extraction that suffers from scale ambiguities across different scenes.
  • Train on cross-paired data (pairs of videos with identical camera motion but different content), which is extremely scarce in real-world scenarios and leads to limited generalization, especially for multi-shot videos.

To address these issues, the paper introduces the camera grid – a visual representation that encodes camera motion as a video of an empty 3D room with grid lines. This representation is:

  • General: Handles both single-shot and multi-shot videos uniformly.
  • Decoupled: The empty scene prevents information leakage (e.g., appearance, object motion) from the reference video.
  • Scalable & Compatible: Any video can be automatically paired with its camera grid, enabling large-scale data construction. The grid is a spatiotemporal signal like video, making it easy for diffusion models to process.

Methodology

Camera Grid Representation

Given camera extrinsics (\mathbf{P} = { \mathbf{R}_i, \mathbf{t}i }{i=1}^T) from a reference video:

  1. Spatial scene modeling: Define floor and ceiling at heights [ y_{\text{floor}} = \bar{y} - \Delta h, \quad y_{\text{ceiling}} = \bar{y} + \Delta h \tag{1,2} ] where (\bar{y}) is the average camera height, (\Delta h) proportional to the median inter-pose distance.
  2. Tunnel wall: Vertical lines connect floor and ceiling within an annular region [ W = { (x,z) \mid r < d_{\text{traj}}(x,z) < r + \delta } \tag{3} ] where (d_{\text{traj}}) is distance to the projected camera trajectory, (r) inner radius, (\delta) wall thickness.
  3. Rendering: For each frame, project grid endpoints via camera extrinsic ([\mathbf{R}_i|\mathbf{t}_i]) and intrinsic matrices.
  4. Special effects:
    • Fisheye: Use Kannala–Brandt model with distortion angle [ \theta = \arctan(r'/\zeta), \quad \theta_d = \theta(1 + k_1\theta^2 + k_2\theta^4 + k_3\theta^6 + k_4\theta^8) \tag{4,5} ]
    • Dolly zoom: Focal length (\phi \propto \rho) (distance to subject) keeping subject size constant. \tag{6}
  5. Multi-shot handling: Detect transitions (TransNet-V2), render each sub-clip separately, and insert special white frames at transition points.

OmniDirector Architecture

  • Token concatenation: Encode camera grid (G) and reference image (I) via 3D-VAE into latents (\mathbf{z}_c, \mathbf{z}_I). Concatenate with noisy video latent (\mathbf{z}v) along frame dimension: [ \mathbf{z}{\text{vis}} = \text{Concat}(\mathbf{z}_I, \mathbf{z}v, \mathbf{z}c) \in \mathbb{R}^{(2T+1) \times H \times W \times C} ] Then patchify to tokens (\mathbf{Z}{\text{vis}} \in \mathbb{R}^{N{\text{vis}} \times D}). \tag{7}
  • MMDiT blocks: Joint attention between visual and text tokens across separate streams: [ \mathbf{Z}^{(l+1)}{\text{vis}} = \text{FFN}\big(\text{Attention}(\text{LN}(\mathbf{Z}^{(l)}{\text{vis}}), \mathbf{Z}^{(l)}t)\big) + \mathbf{Z}^{(l)}{\text{vis}} \tag{8} ]
  • Training: 30% of samples use self-reconstruction (model reconstructs camera grid itself) to enforce geometric understanding; remaining 70% use standard camera-conditioned video generation.

Hierarchical Prompt Expansion Agent (Inference)

  1. Camera prompt generation: A fine-tuned Qwen3-VL generates camera motion descriptions (\mathcal{T}_c) directly from camera parameters (poses + transitions). It decomposes into:
    • Inter-shot: Relationship between adjacent shots (e.g., transition style).
    • Intra-shot: Per-shot motion description via pose analysis (translation/rotation axes, speed, arc shot detection using rules like Eqs. 9–10).
  2. Semantic fusion: The agent integrates camera prompt (\mathcal{T}_c), reference image (I), and user prompt (\mathcal{T}_u) into a final cohesive text (\mathcal{T}_f) via Qwen3-VL.
  3. Adaptive CFG: Use black background as unconditional visual input and "completely static camera" as negative text. Apply coarse-to-fine denoising: camera grid features dominate early (high-noise) stages to set global structure; other signals refine later.

Empirical Validation / Results

Evaluation Metrics

  • Camera Control: Relative Rotation Error (RRE), Relative Translation Error (RTE), and precision thresholds: R-Pre (<4°), T-Pre (<20°).
  • Transition Accuracy: Tem-Pre (temporal alignment <3 frames), Sem-Pre (semantic type match via Gemini 3.1 Pro).
  • Leakage Rate: Percentage of frames/shots that leak content from the reference video.
  • GSB Pairwise Comparison: Good/Same/Bad vs. CamCloneMaster across camera, quality, narrative.

Quantitative Results

Table 1: Quantitative Comparisons

MethodRRE(°)↓R-Pre(%)↑RTE(°)↓T-Pre(%)↑Tem-Pre(%)↑Sem-Pre(%)↑Frame Leakage(%)↓Shot Leakage(%)↓
Seedance2.08.3356.4949.9829.074.174.4320.90
CamCloneMaster4.1174.1427.4552.212.201.6011.59
LTX-LoRA5.6766.3426.9652.0738.9429.5515.0456.52
Ours2.6483.1816.8472.7496.5283.790.513.38

OmniDirector outperforms all baselines significantly, especially in T-Pre (39.3% relative improvement) and transition accuracy (Tem-Pre 96.52%), with minimal leakage.

Table 2: Ablation Studies

SettingRRE↓R-Pre↑RTE↓T-Pre↑Tem-Pre↑Sem-Pre↑Shot Leakage↓
w/o Semantic Fusion3.8578.2019.9067.4594.4078.304.10
w/o Trans PE2.7181.5017.1071.2593.3538.453.45
w/o AdaCFG4.1574.5521.4162.3094.1080.203.83
Full2.6483.1816.8472.7496.5283.793.38

All components (semantic fusion, inter-shot prompt, adaptive CFG) contribute to performance, with inter-shot prompt critical for Sem-Pre (drop from 83.79% to 38.45%).

Table 3: GSB Comparison Ours vs. CamCloneMaster

Dimension(G+S)/T (%)G/(G+B) (%)(G+S)/(B+S)
Camera88.5286.293.19
Quality95.6990.821.67
Narrative94.2685.711.44
Average (3:1:1)91.1087.082.54

Our method is preferred across all dimensions.

Qualitative Results

  • Figure 6: OmniDirector accurately clones multi-shot camera motions (pull back, pan, orbit) while baselines fail or leak content.
  • Figure 7: Ablation visualizations show that adaptive CFG prevents slow camera rotation; semantic fusion ensures plausible content; inter-shot prompt prevents random scene transitions.

Emergent Capabilities

  • Without retraining, conditioning on raw reference videos or Canny edge sequences robustly drives camera motion, demonstrating zero-shot generalization and ability to clone complex effects like Hitchcock zoom and fisheye distortion.

Theoretical and Practical Implications

  • Theoretical: The camera grid representation bridges the modality gap between parametric camera control and visual signals, enabling diffusion models to learn camera dynamics from massive internet-scale data. The self-reconstruction training strategy forces the model to understand grid geometry rather than treat it as a weak condition.
  • Practical: OmniDirector provides an intuitive way for users to clone camera motions from any reference video (including multi-shot) without manual parameter tuning or cross-paired data collection. The hierarchical prompt agent allows seamless integration with other controls (subject, action). This opens applications in filmmaking, advertising, and content creation where complex camera choreography is needed.
  • Leakage reduction: Near-zero leakage (0.51% frame, 3.38% shot) ensures the generated video faithfully adopts new content while keeping only the camera motion.

Conclusion

OmniDirector achieves general multi-shot camera cloning without cross-paired data by introducing:

  • A camera grid visual representation that decouples camera motion from other signals and enables large-scale training.
  • A million-scale camera grid–video dataset for training MMDiTs.
  • A hierarchical Prompt Expansion Agent that integrates camera, subject, and action signals into a unified text prompt.

Extensive experiments demonstrate superior camera accuracy, transition precision, and low leakage. Future work includes exploring advanced temporal memory mechanisms (long-context cross-attention, memory banks) to handle significantly longer video sequences.

Related papers