# OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

> OmniDirector clones multi-shot camera motion from reference videos without cross-paired data, achieving top accuracy with a visual camera grid representation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.13432)
- **Published:** 2026-06-16
- **Permalink:** https://picx.dev/p/OtVz9E
- **Whiteboard:** https://picx.dev/p/OtVz9E/image

## Summary

## Summary (Overview)
- **OmniDirector** proposes a unified framework for **multi-shot camera cloning** from reference videos without requiring cross-paired training data (videos with identical camera motion but different content).
- A novel **camera grid** representation encodes camera motion as a visual grid video rendered from an empty 3D scene, enabling decoupling of camera signals from appearance and scalability to large-scale datasets.
- A **million-scale camera grid–video dataset** (1.8M videos) is constructed automatically from internet videos, enabling robust training of a Multi-Modal Diffusion Transformer (MMDiT).
- A **hierarchical Prompt Expansion (PE) Agent** at inference harmoniously integrates camera motion, subject, and action signals into a unified textual prompt for precise multi-modal control.
- Extensive experiments show superior performance over baselines (CamCloneMaster, Seedance2.0, LTX-LoRA) in camera accuracy (e.g., **T-Pre 72.74%** vs. next best 52.21%), transition accuracy, and leakage reduction.

## Introduction and Theoretical Foundation
Camera control is crucial for video generation but existing methods face trade-offs between precision (explicit camera parameters) and accessibility (textual descriptions). **Video-referenced** camera cloning offers a balance, but current approaches either:
- Rely on explicit parameter extraction that suffers from **scale ambiguities** across different scenes.
- Train on cross-paired data (pairs of videos with identical camera motion but different content), which is **extremely scarce** in real-world scenarios and leads to limited generalization, especially for multi-shot videos.

To address these issues, the paper introduces the **camera grid** – a visual representation that encodes camera motion as a video of an empty 3D room with grid lines. This representation is:
- **General**: Handles both single-shot and multi-shot videos uniformly.
- **Decoupled**: The empty scene prevents information leakage (e.g., appearance, object motion) from the reference video.
- **Scalable & Compatible**: Any video can be automatically paired with its camera grid, enabling large-scale data construction. The grid is a spatiotemporal signal like video, making it easy for diffusion models to process.

## Methodology
### Camera Grid Representation
Given camera extrinsics \(\mathbf{P} = \{ \mathbf{R}_i, \mathbf{t}_i \}_{i=1}^T\) from a reference video:
1. **Spatial scene modeling**: Define floor and ceiling at heights
   \[
   y_{\text{floor}} = \bar{y} - \Delta h, \quad y_{\text{ceiling}} = \bar{y} + \Delta h \tag{1,2}
   \]
   where \(\bar{y}\) is the average camera height, \(\Delta h\) proportional to the median inter-pose distance.
2. **Tunnel wall**: Vertical lines connect floor and ceiling within an annular region
   \[
   W = \{ (x,z) \mid r < d_{\text{traj}}(x,z) < r + \delta \} \tag{3}
   \]
   where \(d_{\text{traj}}\) is distance to the projected camera trajectory, \(r\) inner radius, \(\delta\) wall thickness.
3. **Rendering**: For each frame, project grid endpoints via camera extrinsic \([\mathbf{R}_i|\mathbf{t}_i]\) and intrinsic matrices.
4. **Special effects**:
   - **Fisheye**: Use Kannala–Brandt model with distortion angle
     \[
     \theta = \arctan(r'/\zeta), \quad \theta_d = \theta(1 + k_1\theta^2 + k_2\theta^4 + k_3\theta^6 + k_4\theta^8) \tag{4,5}
     \]
   - **Dolly zoom**: Focal length \(\phi \propto \rho\) (distance to subject) keeping subject size constant. \tag{6}
5. **Multi-shot handling**: Detect transitions (TransNet-V2), render each sub-clip separately, and insert special white frames at transition points.

### OmniDirector Architecture
- **Token concatenation**: Encode camera grid \(G\) and reference image \(I\) via 3D-VAE into latents \(\mathbf{z}_c, \mathbf{z}_I\). Concatenate with noisy video latent \(\mathbf{z}_v\) along frame dimension:
  \[
  \mathbf{z}_{\text{vis}} = \text{Concat}(\mathbf{z}_I, \mathbf{z}_v, \mathbf{z}_c) \in \mathbb{R}^{(2T+1) \times H \times W \times C}
  \]
  Then patchify to tokens \(\mathbf{Z}_{\text{vis}} \in \mathbb{R}^{N_{\text{vis}} \times D}\). \tag{7}
- **MMDiT blocks**: Joint attention between visual and text tokens across separate streams:
  \[
  \mathbf{Z}^{(l+1)}_{\text{vis}} = \text{FFN}\big(\text{Attention}(\text{LN}(\mathbf{Z}^{(l)}_{\text{vis}}), \mathbf{Z}^{(l)}_t)\big) + \mathbf{Z}^{(l)}_{\text{vis}} \tag{8}
  \]
- **Training**: 30% of samples use self-reconstruction (model reconstructs camera grid itself) to enforce geometric understanding; remaining 70% use standard camera-conditioned video generation.

### Hierarchical Prompt Expansion Agent (Inference)
1. **Camera prompt generation**: A fine-tuned Qwen3-VL generates camera motion descriptions \(\mathcal{T}_c\) directly from camera parameters (poses + transitions). It decomposes into:
   - **Inter-shot**: Relationship between adjacent shots (e.g., transition style).
   - **Intra-shot**: Per-shot motion description via pose analysis (translation/rotation axes, speed, arc shot detection using rules like Eqs. 9–10).
2. **Semantic fusion**: The agent integrates camera prompt \(\mathcal{T}_c\), reference image \(I\), and user prompt \(\mathcal{T}_u\) into a final cohesive text \(\mathcal{T}_f\) via Qwen3-VL.
3. **Adaptive CFG**: Use black background as unconditional visual input and "completely static camera" as negative text. Apply coarse-to-fine denoising: camera grid features dominate early (high-noise) stages to set global structure; other signals refine later.

## Empirical Validation / Results
### Evaluation Metrics
- **Camera Control**: Relative Rotation Error (RRE), Relative Translation Error (RTE), and precision thresholds: R-Pre (<4°), T-Pre (<20°).
- **Transition Accuracy**: Tem-Pre (temporal alignment <3 frames), Sem-Pre (semantic type match via Gemini 3.1 Pro).
- **Leakage Rate**: Percentage of frames/shots that leak content from the reference video.
- **GSB Pairwise Comparison**: Good/Same/Bad vs. CamCloneMaster across camera, quality, narrative.

### Quantitative Results
**Table 1: Quantitative Comparisons**

| Method | RRE(°)↓ | R-Pre(%)↑ | RTE(°)↓ | T-Pre(%)↑ | Tem-Pre(%)↑ | Sem-Pre(%)↑ | Frame Leakage(%)↓ | Shot Leakage(%)↓ |
|---|---|---|---|---|---|---|---|---|
| Seedance2.0 | 8.33 | 56.49 | 49.98 | 29.07 | 4.17 | – | 4.43 | 20.90 |
| CamCloneMaster | 4.11 | 74.14 | 27.45 | 52.21 | 2.20 | – | 1.60 | 11.59 |
| LTX-LoRA | 5.67 | 66.34 | 26.96 | 52.07 | 38.94 | 29.55 | 15.04 | 56.52 |
| **Ours** | **2.64** | **83.18** | **16.84** | **72.74** | **96.52** | **83.79** | **0.51** | **3.38** |

OmniDirector outperforms all baselines significantly, especially in T-Pre (39.3% relative improvement) and transition accuracy (Tem-Pre 96.52%), with minimal leakage.

**Table 2: Ablation Studies**

| Setting | RRE↓ | R-Pre↑ | RTE↓ | T-Pre↑ | Tem-Pre↑ | Sem-Pre↑ | Shot Leakage↓ |
|---|---|---|---|---|---|---|---|
| w/o Semantic Fusion | 3.85 | 78.20 | 19.90 | 67.45 | 94.40 | 78.30 | 4.10 |
| w/o Trans PE | 2.71 | 81.50 | 17.10 | 71.25 | 93.35 | 38.45 | 3.45 |
| w/o AdaCFG | 4.15 | 74.55 | 21.41 | 62.30 | 94.10 | 80.20 | 3.83 |
| **Full** | **2.64** | **83.18** | **16.84** | **72.74** | **96.52** | **83.79** | **3.38** |

All components (semantic fusion, inter-shot prompt, adaptive CFG) contribute to performance, with inter-shot prompt critical for Sem-Pre (drop from 83.79% to 38.45%).

**Table 3: GSB Comparison Ours vs. CamCloneMaster**

| Dimension | (G+S)/T (%) | G/(G+B) (%) | (G+S)/(B+S) |
|---|---|---|---|
| Camera | 88.52 | 86.29 | 3.19 |
| Quality | 95.69 | 90.82 | 1.67 |
| Narrative | 94.26 | 85.71 | 1.44 |
| Average (3:1:1) | 91.10 | 87.08 | 2.54 |

Our method is preferred across all dimensions.

### Qualitative Results
- **Figure 6**: OmniDirector accurately clones multi-shot camera motions (pull back, pan, orbit) while baselines fail or leak content.
- **Figure 7**: Ablation visualizations show that adaptive CFG prevents slow camera rotation; semantic fusion ensures plausible content; inter-shot prompt prevents random scene transitions.

### Emergent Capabilities
- Without retraining, conditioning on raw reference videos or Canny edge sequences robustly drives camera motion, demonstrating **zero-shot generalization** and ability to clone complex effects like Hitchcock zoom and fisheye distortion.

## Theoretical and Practical Implications
- **Theoretical**: The camera grid representation bridges the modality gap between parametric camera control and visual signals, enabling diffusion models to learn camera dynamics from massive internet-scale data. The self-reconstruction training strategy forces the model to understand grid geometry rather than treat it as a weak condition.
- **Practical**: OmniDirector provides an intuitive way for users to clone camera motions from any reference video (including multi-shot) without manual parameter tuning or cross-paired data collection. The hierarchical prompt agent allows seamless integration with other controls (subject, action). This opens applications in filmmaking, advertising, and content creation where complex camera choreography is needed.
- **Leakage reduction**: Near-zero leakage (0.51% frame, 3.38% shot) ensures the generated video faithfully adopts new content while keeping only the camera motion.

## Conclusion
OmniDirector achieves **general multi-shot camera cloning without cross-paired data** by introducing:
- A **camera grid** visual representation that decouples camera motion from other signals and enables large-scale training.
- A million-scale camera grid–video dataset for training MMDiTs.
- A **hierarchical Prompt Expansion Agent** that integrates camera, subject, and action signals into a unified text prompt.

Extensive experiments demonstrate superior camera accuracy, transition precision, and low leakage. Future work includes exploring advanced temporal memory mechanisms (long-context cross-attention, memory banks) to handle significantly longer video sequences.

---

_Markdown view of https://picx.dev/p/OtVz9E, served by PicX — AI-generated visual whiteboard summaries of research papers._
