3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Summary (Overview)

Problem: Existing subject-driven video generation methods treat subjects as 2D entities, lacking the 3D spatial priors needed for consistent novel-view synthesis. Direct fine-tuning on video sequences leads to temporal overfitting.
Solution: A novel framework comprising 3DreamBooth and 3Dapter for 3D-aware video customization.
- 3DreamBooth: A 1-frame optimization paradigm that decouples spatial geometry from temporal motion, baking a robust 3D prior into the model without exhaustive video training.
- 3Dapter: A visual conditioning module that undergoes multi-view joint optimization with the main branch via an asymmetrical conditioning strategy, acting as a dynamic selective router for view-specific geometric hints.
Key Contribution: Achieves high-fidelity, view-consistent video generation of customized 3D subjects from a few multi-view reference images, outperforming single-reference baselines in 3D geometric fidelity and identity preservation.
Evaluation: Introduces 3D-CustomBench, a curated benchmark for 3D-consistent video customization, and demonstrates superior performance through quantitative metrics (Chamfer Distance, CLIP-I, DINO-I, GPT-4o evaluation) and qualitative comparisons.

Introduction and Theoretical Foundation

Creating dynamic, view-consistent videos of customized subjects is crucial for applications like VR/AR, virtual production, and e-commerce. While subject-driven customization has progressed, existing methods (e.g., DreamBooth, visual adapters) are predominantly 2D-centric, binding identity through single-view features or textual prompts. This approach fails for 3D object customization because it lacks comprehensive spatial priors, forcing the model to generate plausible but arbitrary details for unseen regions instead of preserving the true 3D identity.

The core challenge is the scarcity of multi-view video datasets. Fine-tuning on limited sequences often leads to temporal overfitting. The paper posits that modern video diffusion models possess implicit 3D priors (e.g., they naturally generate videos preserving 3D geometric consistency of objects like a "dog"). The goal is to explicitly leverage this inherent capability for customization by injecting a subject's multi-view identity.

Methodology

The framework consists of two main components optimized in a two-stage pipeline.

3DreamBooth: 1-Frame Optimization for 3D Identity Injection

This component fine-tunes the generative backbone (a pre-trained video Diffusion Transformer, DiT) via LoRA to internalize a subject's 3D identity from multi-view static images.

Key Insight: Object identity is a spatial attribute. Using a 1-frame training paradigm (input T = 1) naturally bypasses the model's temporal attention mechanism, confining gradient updates to spatial representations and preserving pre-trained temporal priors. This avoids entanglement of spatial identity with temporal dynamics and prevents overfitting to specific motions.
Training Process: Given a set of static multi-view images of a subject S = { s^{(i)} }_{i=1}^{N_s}, each is treated as a single-frame video. All views use a consistent universal text prompt p containing a unique identifier V and a class noun C (e.g., "a video of a V C"). This forces the model to internalize multi-view variations into the token V.
Optimization Objective: Trainable LoRA weights ϕ_{3DB} are injected into transformer blocks while keeping original parameters θ frozen. The objective is the velocity prediction loss: $\arg\min_{\phi_{3DB}} \mathbb{E}_{i,\epsilon,t} \left[ \| \bar{v} - v_{\theta,\phi_{3DB}}( z^{(i)}_t, t, p ) \|_2^2 \right]$ where i is a sampled view index, z^{(i)}_t is the noisy latent of view s^{(i)} at timestep t, \bar{v} is the target velocity vector, and p is the text prompt.
Limitation: This text-driven approach has an information bottleneck. The single token V struggles to encode high-frequency details (intricate textures, specific texts), leading to slow optimization and loss of fine-grained textures.

3Dapter: Multi-View Conditioning Module

To overcome the bottleneck, 3Dapter is introduced as a visual conditioning module that directly injects reference image features.

Architecture: Adopts a dual-branch forward pass inspired by controllable DiTs (e.g., OminiControl). It uses a dedicated LoRA branch (ϕ_{3Dapter}) to process condition images.
Two-Stage Training:
1. Single-view Pre-training: Trained on a large-scale dataset of reference-target image pairs {( x^{(i)}, y^{(i)}, p^{(i)} )}_{i=1}^{N_D} using the objective: $\arg\min_{\phi_{3Dapter}} \mathbb{E}_{i,\epsilon,t} \left[ \| \bar{v} - v_{\theta,\phi_{3Dapter}}( z^{(i)}_t, t, x^{(i)}, p^{(i)} ) \|_2^2 \right]$ The interaction between reference (x), target (y), and text (p) is modeled by concatenating their Query (Q), Key (K), and Value (V) tensors along the sequence dimension and performing joint spatio-temporal attention: $Q = [ Q_z, Q_x, Q_p ], \quad K = [ K_y, K_x, K_p ], \quad V = [ V_y, V_x, V_p ]$ where [·,·,·] denotes concatenation. Q, K, V ∈ ℝ^{(2N_{img} + N_{txt}) × d}.
2. Multi-view Joint Optimization: For a specific subject with multi-view images S, a subset of conditioning views X = { x^{(i)} }_{i=1}^{N_c} (where X ⊂ S and |X| = N_c, typically N_c = 4) is used. 3DreamBooth and 3Dapter are jointly optimized: $\arg\min_{\phi_{3DB}, \phi_{3Dapter}} \mathbb{E}_{i,\epsilon,t} \left[ \| \bar{v} - v_{\theta,\phi_{3DB}, \phi_{3Dapter}}( z^{(i)}_t, t, X, p ) \|_2^2 \right]$ All conditioning images X are processed through a single, shared 3Dapter. For each joint attention module, tensors are concatenated. For example, a joint Query tensor is: $Q = [ Q_z, Q^{(1)}_x, ..., Q^{(N_c)}_x, Q_p ]$ Distinct temporal indices are assigned via 3D Rotary Positional Encoding (RoPE) to each conditioning view to prevent feature entanglement.
Emergent Behavior - Dynamic Selective Router: The joint attention mechanism learns to query and extract only relevant, view-specific geometric hints from the multi-view references to reconstruct the target view, filtering out conflicting signals from irrelevant views (see Fig. 5).

Empirical Validation / Results

Evaluation Benchmark: 3D-CustomBench

A novel benchmark suite of 30 objects with complex 3D structures, high textures, and 360° coverage (from MVImgNet and custom captures). For each object, the full multi-view sequence (~30 images) is used for 3DreamBooth optimization, and N_c = 4 conditioning views are sampled for 3Dapter. GPT-4o generates one challenging validation prompt per object.

Metrics

Multi-View Subject Fidelity: CLIP-I, DINO-I (bi-directional max cosine similarity between generated frames and condition views), and GPT-4o-as-a-Judge (evaluating Shape, Color, Detail, Overall Identity on a 1-5 scale).
3D Geometric Fidelity: Chamfer Distance (CD) between point clouds reconstructed from ground-truth multi-view images (P_{gt}) and generated 360° rotation videos (P_{gen}). CD averages Accuracy (dist(P_{gen} → P_{gt})) and Completeness (dist(P_{gt} → P_{gen})).
Video Quality & Text Alignment: VBench metrics (Aesthetic Quality, Imaging Quality, Motion Smoothness) and ViCLIP score for video-text alignment.

Quantitative Results

Table 1: Multi-View Subject Fidelity

Method	Views	CLIP-I ↑	DINO-I ↑	Shape ↑	Color ↑	Detail ↑	Overall ↑
VACE [18]	S	0.8964	0.7395	4.39 ± 0.05	4.09 ± 0.09	3.35 ± 0.15	3.95 ± 0.11
Phantom [24]	S	0.8576	0.5861	3.48 ± 0.12	3.94 ± 0.13	3.03 ± 0.16	3.31 ± 0.15
3Dapter	S	0.8647	0.5899	3.06 ± 0.03	3.09 ± 0.06	2.28 ± 0.08	2.67 ± 0.07
3DreamBooth	M	0.8382	0.6530	4.18 ± 0.06	3.63 ± 0.09	3.14 ± 0.11	3.53 ± 0.07
3Dapter+3DB	M	0.8871	0.7420	4.80 ± 0.03	4.53 ± 0.04	4.04 ± 0.13	4.57 ± 0.04

The full model (3Dapter+3DB) achieves best performance on most metrics, especially human-centric GPT-4o evaluations.
VACE's higher CLIP-I is attributed to CLIP prioritizing high-level semantics over geometric accuracy.

Table 2: 3D Geometric Fidelity

Method	Views	Accuracy ↓	Completeness ↓	CD ↓
VACE [18]	S	0.0278	0.0427	0.0353
Phantom [24]	S	0.0289	0.0388	0.0338
3Dapter	S	0.0315	0.0659	0.0487
3DreamBooth	M	0.0156	0.0322	0.0239
3Dapter+3DB	M	0.0182	0.0172	0.0177

The full model achieves the lowest Chamfer Distance (0.0177), nearly halving the error of the best single-view method (Phantom, 0.0338).
The significant lead in Completeness (0.0172) demonstrates effective recovery of full 360° geometry.

Table 3: Video Quality and Text Alignment

Method	Type	Aesthetic Quality ↑	Imaging Quality ↑	Motion Smoothness ↑	ViCLIP ↑
VACE [18]	S	0.5915	70.84	0.9916	0.2663
Phantom [24]	S	0.5798	70.58	0.9934	0.2634
3Dapter	S	0.6283	71.65	0.9944	0.2048
3DreamBooth	M	0.5245	73.34	0.9928	0.2415
3Dapter + 3DB	M	0.5920	74.33	0.9918	0.2388

The framework maintains high intrinsic video quality and competitive text alignment, outperforming baselines in Imaging Quality.

Qualitative Results & Ablation Studies

Qualitative: Fig. 6 shows that baselines (VACE, Phantom) conditioned only on the first view produce inconsistent textures and geometries during rotation, while the proposed framework synthesizes full 360° geometry preserving identity.
Ablation: Studies (Tables 1 & 2) show the necessity of both components.
- 3Dapter alone (Single-view) provides strong aesthetic quality but lacks 3D consistency.
- 3DreamBooth alone (Multi-view) ensures better geometry but struggles with fine-grained texture details.
- The joint optimization (3Dapter+3DB) yields the best trade-off, combining robust 3D structure with high-frequency feature injection.

Theoretical and Practical Implications

Theoretical: Demonstrates that the implicit 3D priors in modern video diffusion models can be explicitly harnessed for subject customization through multi-view conditioning and a decoupled optimization strategy. Introduces the concept of a dynamic selective router within the joint attention mechanism for efficient multi-view feature extraction.
Practical: Provides a computationally efficient framework for generating high-fidelity, view-consistent videos of customized 3D subjects from a minimal set of reference images. This has direct applications in:
- Immersive VR/AR content creation.
- Virtual production and dynamic advertising.
- Game development (animating custom characters).
- Next-generation e-commerce (product showcases).

Conclusion

The paper introduces a highly efficient framework for 3D-aware video customization. Key innovations are:

3DreamBooth: A 1-frame optimization paradigm that decouples spatial identity from temporal motion to embed subject-specific 3D priors without temporal overfitting.
3Dapter: A multi-view conditioning module that acts as a dynamic selective router to preserve intricate textures and accelerate convergence.
Synergistic Approach: Joint optimization combines the strengths of both, achieving state-of-the-art 3D geometric fidelity and fast convergence.
3D-CustomBench: A curated evaluation benchmark for the emerging task.

The framework paves the way for advanced applications requiring faithful 3D subject integration into dynamic environments. Future directions may include scaling to more complex subjects (e.g., humans, scenes) and exploring conditional generation beyond orbital rotations.