Summary (Overview)
- OmniDirector proposes a unified framework for multi-shot camera cloning from reference videos without requiring cross-paired training data (videos with identical camera motion but different content).
- A novel camera grid representation encodes camera motion as a visual grid video rendered from an empty 3D scene, enabling decoupling of camera signals from appearance and scalability to large-scale datasets.
- A million-scale camera grid–video dataset (1.8M videos) is constructed automatically from internet videos, enabling robust training of a Multi-Modal Diffusion Transformer (MMDiT).
- A hierarchical Prompt Expansion (PE) Agent at inference harmoniously integrates camera motion, subject, and action signals into a unified textual prompt for precise multi-modal control.
- Extensive experiments show superior performance over baselines (CamCloneMaster, Seedance2.0, LTX-LoRA) in camera accuracy (e.g., T-Pre 72.74% vs. next best 52.21%), transition accuracy, and leakage reduction.
Introduction and Theoretical Foundation
Camera control is crucial for video generation but existing methods face trade-offs between precision (explicit camera parameters) and accessibility (textual descriptions). Video-referenced camera cloning offers a balance, but current approaches either:
- Rely on explicit parameter extraction that suffers from scale ambiguities across different scenes.
- Train on cross-paired data (pairs of videos with identical camera motion but different content), which is extremely scarce in real-world scenarios and leads to limited generalization, especially for multi-shot videos.
To address these issues, the paper introduces the camera grid – a visual representation that encodes camera motion as a video of an empty 3D room with grid lines. This representation is:
- General: Handles both single-shot and multi-shot videos uniformly.
- Decoupled: The empty scene prevents information leakage (e.g., appearance, object motion) from the reference video.
- Scalable & Compatible: Any video can be automatically paired with its camera grid, enabling large-scale data construction. The grid is a spatiotemporal signal like video, making it easy for diffusion models to process.
Methodology
Camera Grid Representation
Given camera extrinsics (\mathbf{P} = { \mathbf{R}_i, \mathbf{t}i }{i=1}^T) from a reference video:
- Spatial scene modeling: Define floor and ceiling at heights [ y_{\text{floor}} = \bar{y} - \Delta h, \quad y_{\text{ceiling}} = \bar{y} + \Delta h \tag{1,2} ] where (\bar{y}) is the average camera height, (\Delta h) proportional to the median inter-pose distance.
- Tunnel wall: Vertical lines connect floor and ceiling within an annular region [ W = { (x,z) \mid r < d_{\text{traj}}(x,z) < r + \delta } \tag{3} ] where (d_{\text{traj}}) is distance to the projected camera trajectory, (r) inner radius, (\delta) wall thickness.
- Rendering: For each frame, project grid endpoints via camera extrinsic ([\mathbf{R}_i|\mathbf{t}_i]) and intrinsic matrices.
- Special effects:
- Fisheye: Use Kannala–Brandt model with distortion angle [ \theta = \arctan(r'/\zeta), \quad \theta_d = \theta(1 + k_1\theta^2 + k_2\theta^4 + k_3\theta^6 + k_4\theta^8) \tag{4,5} ]
- Dolly zoom: Focal length (\phi \propto \rho) (distance to subject) keeping subject size constant. \tag{6}
- Multi-shot handling: Detect transitions (TransNet-V2), render each sub-clip separately, and insert special white frames at transition points.
OmniDirector Architecture
- Token concatenation: Encode camera grid (G) and reference image (I) via 3D-VAE into latents (\mathbf{z}_c, \mathbf{z}_I). Concatenate with noisy video latent (\mathbf{z}v) along frame dimension: [ \mathbf{z}{\text{vis}} = \text{Concat}(\mathbf{z}_I, \mathbf{z}v, \mathbf{z}c) \in \mathbb{R}^{(2T+1) \times H \times W \times C} ] Then patchify to tokens (\mathbf{Z}{\text{vis}} \in \mathbb{R}^{N{\text{vis}} \times D}). \tag{7}
- MMDiT blocks: Joint attention between visual and text tokens across separate streams: [ \mathbf{Z}^{(l+1)}{\text{vis}} = \text{FFN}\big(\text{Attention}(\text{LN}(\mathbf{Z}^{(l)}{\text{vis}}), \mathbf{Z}^{(l)}t)\big) + \mathbf{Z}^{(l)}{\text{vis}} \tag{8} ]
- Training: 30% of samples use self-reconstruction (model reconstructs camera grid itself) to enforce geometric understanding; remaining 70% use standard camera-conditioned video generation.
Hierarchical Prompt Expansion Agent (Inference)
- Camera prompt generation: A fine-tuned Qwen3-VL generates camera motion descriptions (\mathcal{T}_c) directly from camera parameters (poses + transitions). It decomposes into:
- Inter-shot: Relationship between adjacent shots (e.g., transition style).
- Intra-shot: Per-shot motion description via pose analysis (translation/rotation axes, speed, arc shot detection using rules like Eqs. 9–10).
- Semantic fusion: The agent integrates camera prompt (\mathcal{T}_c), reference image (I), and user prompt (\mathcal{T}_u) into a final cohesive text (\mathcal{T}_f) via Qwen3-VL.
- Adaptive CFG: Use black background as unconditional visual input and "completely static camera" as negative text. Apply coarse-to-fine denoising: camera grid features dominate early (high-noise) stages to set global structure; other signals refine later.
Empirical Validation / Results
Evaluation Metrics
- Camera Control: Relative Rotation Error (RRE), Relative Translation Error (RTE), and precision thresholds: R-Pre (<4°), T-Pre (<20°).
- Transition Accuracy: Tem-Pre (temporal alignment <3 frames), Sem-Pre (semantic type match via Gemini 3.1 Pro).
- Leakage Rate: Percentage of frames/shots that leak content from the reference video.
- GSB Pairwise Comparison: Good/Same/Bad vs. CamCloneMaster across camera, quality, narrative.
Quantitative Results
Table 1: Quantitative Comparisons
| Method | RRE(°)↓ | R-Pre(%)↑ | RTE(°)↓ | T-Pre(%)↑ | Tem-Pre(%)↑ | Sem-Pre(%)↑ | Frame Leakage(%)↓ | Shot Leakage(%)↓ |
|---|---|---|---|---|---|---|---|---|
| Seedance2.0 | 8.33 | 56.49 | 49.98 | 29.07 | 4.17 | – | 4.43 | 20.90 |
| CamCloneMaster | 4.11 | 74.14 | 27.45 | 52.21 | 2.20 | – | 1.60 | 11.59 |
| LTX-LoRA | 5.67 | 66.34 | 26.96 | 52.07 | 38.94 | 29.55 | 15.04 | 56.52 |
| Ours | 2.64 | 83.18 | 16.84 | 72.74 | 96.52 | 83.79 | 0.51 | 3.38 |
OmniDirector outperforms all baselines significantly, especially in T-Pre (39.3% relative improvement) and transition accuracy (Tem-Pre 96.52%), with minimal leakage.
Table 2: Ablation Studies
| Setting | RRE↓ | R-Pre↑ | RTE↓ | T-Pre↑ | Tem-Pre↑ | Sem-Pre↑ | Shot Leakage↓ |
|---|---|---|---|---|---|---|---|
| w/o Semantic Fusion | 3.85 | 78.20 | 19.90 | 67.45 | 94.40 | 78.30 | 4.10 |
| w/o Trans PE | 2.71 | 81.50 | 17.10 | 71.25 | 93.35 | 38.45 | 3.45 |
| w/o AdaCFG | 4.15 | 74.55 | 21.41 | 62.30 | 94.10 | 80.20 | 3.83 |
| Full | 2.64 | 83.18 | 16.84 | 72.74 | 96.52 | 83.79 | 3.38 |
All components (semantic fusion, inter-shot prompt, adaptive CFG) contribute to performance, with inter-shot prompt critical for Sem-Pre (drop from 83.79% to 38.45%).
Table 3: GSB Comparison Ours vs. CamCloneMaster
| Dimension | (G+S)/T (%) | G/(G+B) (%) | (G+S)/(B+S) |
|---|---|---|---|
| Camera | 88.52 | 86.29 | 3.19 |
| Quality | 95.69 | 90.82 | 1.67 |
| Narrative | 94.26 | 85.71 | 1.44 |
| Average (3:1:1) | 91.10 | 87.08 | 2.54 |
Our method is preferred across all dimensions.
Qualitative Results
- Figure 6: OmniDirector accurately clones multi-shot camera motions (pull back, pan, orbit) while baselines fail or leak content.
- Figure 7: Ablation visualizations show that adaptive CFG prevents slow camera rotation; semantic fusion ensures plausible content; inter-shot prompt prevents random scene transitions.
Emergent Capabilities
- Without retraining, conditioning on raw reference videos or Canny edge sequences robustly drives camera motion, demonstrating zero-shot generalization and ability to clone complex effects like Hitchcock zoom and fisheye distortion.
Theoretical and Practical Implications
- Theoretical: The camera grid representation bridges the modality gap between parametric camera control and visual signals, enabling diffusion models to learn camera dynamics from massive internet-scale data. The self-reconstruction training strategy forces the model to understand grid geometry rather than treat it as a weak condition.
- Practical: OmniDirector provides an intuitive way for users to clone camera motions from any reference video (including multi-shot) without manual parameter tuning or cross-paired data collection. The hierarchical prompt agent allows seamless integration with other controls (subject, action). This opens applications in filmmaking, advertising, and content creation where complex camera choreography is needed.
- Leakage reduction: Near-zero leakage (0.51% frame, 3.38% shot) ensures the generated video faithfully adopts new content while keeping only the camera motion.
Conclusion
OmniDirector achieves general multi-shot camera cloning without cross-paired data by introducing:
- A camera grid visual representation that decouples camera motion from other signals and enables large-scale training.
- A million-scale camera grid–video dataset for training MMDiTs.
- A hierarchical Prompt Expansion Agent that integrates camera, subject, and action signals into a unified text prompt.
Extensive experiments demonstrate superior camera accuracy, transition precision, and low leakage. Future work includes exploring advanced temporal memory mechanisms (long-context cross-attention, memory banks) to handle significantly longer video sequences.
Related papers
- Latent Spatial Memory for Video World Models
Latent spatial memory stores features in VAE latent space, achieving SOTA video consistency with 10.57x speedup and 55x less GPU memory.
- ABot-Earth 0.5: Generative 3D Earth Model
ABot-Earth 0.5 generates seamless real-world 3D environments from satellite imagery at under 10 min/km² with FID 16.1.
- Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Z-Reward decouples reasoning-heavy judgment from efficient reward deployment, achieving 89.6% teacher and 88.6% student human