AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model - Summary
Summary (Overview)
- Flexible Sparse-View Framework: AnyRecon is a scalable framework for 3D reconstruction from arbitrary, unordered, and sparse input views. It uses a video diffusion model that can condition on a flexible number of captured frames, overcoming the limitation of prior methods restricted to one or two views.
- Geometry-Aware Conditioning: The method introduces a closed-loop strategy that couples generation and reconstruction through an explicit, incrementally updated 3D Geometry Memory and a geometry-driven view selection mechanism based on spatial overlap and visibility, ensuring generation is guided by spatially informative observations.
- Efficient Architecture: To handle long-range conditioning and large scenes efficiently, AnyRecon employs a non-compressive latent encoding (removing temporal compression) to preserve frame-level details, context-window sparse attention to reduce quadratic complexity, and 4-step diffusion distillation for fast inference, achieving up to a 20x speedup.
- Superior Performance: Extensive experiments on DL3DV and Tanks and Temples datasets show AnyRecon outperforms state-of-the-art baselines (DifiX3D+, ViewCrafter, Uni3C) in interpolation and extrapolation tasks, delivering higher fidelity and consistency with significantly reduced inference time (~105 seconds per 40-frame sequence).
Introduction and Theoretical Foundation
Novel view synthesis and 3D reconstruction from sparse, casual captures (e.g., handheld videos) remains challenging. While neural representations like NeRF and 3D Gaussian Splatting offer high fidelity, they require dense, controlled multi-view inputs. Recent diffusion-based approaches mitigate sparsity by synthesizing novel views but face key limitations:
- Limited Conditioning: Many methods condition on only one or two captured RGB frames, weakening appearance fidelity and global context.
- Implicit Geometry: Methods relying solely on RGB images and poses struggle with precise spatial alignment.
- Scalability: Existing video diffusion frameworks are suboptimal for non-sequential inputs with large viewpoint gaps and cannot process large scenes all at once.
AnyRecon aims to enable high-quality, large-scale 3D reconstruction from sparse, arbitrary inputs. Its theoretical foundation rests on creating a persistent global scene memory from captured views and coupling the generative diffusion process with an explicit 3D geometric representation to maintain strict spatial control and consistency across long trajectories.
Methodology
The pipeline operates in an iterative generation-reconstruction loop (Fig. 2). Key components are:
1. Unordered Contextual Video Diffusion
- Inputs: For a target trajectory segment, the model takes selected captured views , and rendered geometric guidance (from a 3D point cloud) under target viewpoints .
- Global Scene Memory: Retrieved reference views are prepended to the sequence, forming a persistent Key-Value (KV) memory cache within the transformer, enabling flexible long-range conditioning.
- Non-Compressive Latent Encoding: Uses a frame-wise 2D VAE instead of a temporally compressive 3D-VAE. This preserves a one-to-one mapping between latent tokens and pixel coordinates, crucial for handling large viewpoint gaps without feature entanglement.
2. Efficient Sparse Attention & 4-Step Sampling
- Context-Window Sparse Attention: To manage the expanded token sequence length and avoid complexity, each target frame attends only to a local temporal window and the retrieved subset of reference views .
- 4-Step Diffusion Sampling: Employs Distribution Matching Distillation (DMD) to distill a pre-trained model into a student network for fast, 4-step inference. The generator loss and critic loss are: where , , are denoised predictions from student, teacher, and critic respectively; is step size; is a normalization factor; and is stop-gradient.
3. Geometry-Aware Conditioning Strategy
- 3D Geometry Memory Update: An explicit point cloud is initialized from sparse inputs. After generating novel views for a segment, a feed-forward model (e.g., [22]) extracts new 3D points from these views to update . This incremental update prevents geometric drift across segments.
- Geometry-Driven View Selection: Instead of using image similarity or FOV heuristics, views are selected from the capture bank based on their geometric contribution to the target viewpoint. For each candidate view , a score is computed: where is the set of points visible from the target view, and is the subset of points in reconstructed from capture view . The top- views are selected as .
Empirical Validation / Results
Datasets & Training: Trained on DL3DV-10K [11]. For each 40-frame clip, conditioning views are randomly selected, with 50% probability from the first 20 frames (narrow-baseline) and 50% from the entire clip (wide-baseline).
Implementation: The Wan2.1-I2V-14B [18] model is fine-tuned using LoRA (rank 32). Training involves: 1) Full attention fine-tuning (100k iterations), 2) Sparse attention warm-up (10k iterations, block size ), 3) DMD2 distillation (30k iterations).
Comparison with State-of-the-Art: Evaluated on DL3DV-Evaluation and Tanks and Temples [9] under Interpolation (conditioning on frames 1, 21, 40) and Extrapolation (conditioning on frames 1, 11, 21, 31) settings. Metrics: PSNR, SSIM, LPIPS.
Table 1: Quantitative Comparison Results
| Method | Interpolation (PSNR↑/SSIM↑/LPIPS↓) | Extrapolation (PSNR↑/SSIM↑/LPIPS↓) | Time (s) ↓ |
|---|---|---|---|
| DL3DV Dataset | |||
| DifiX3D+ [23] | 17.88 / 0.551 / 0.290 | 18.74 / 0.576 / 0.261 | 1200 |
| ViewCrafter [29] | 15.86 / 0.463 / 0.394 | 15.51 / 0.459 / 0.406 | 170 |
| Uni3C [2] | 16.33 / 0.471 / 0.319 | 15.69 / 0.457 / 0.344 | 340 |
| Ours (AnyRecon) | 20.95 / 0.656 / 0.151 | 21.16 / 0.660 / 0.158 | 105 |
| Tanks and Temples Dataset | |||
| DifiX3D+ [23] | 19.43 / 0.629 / 0.163 | 18.67 / 0.594 / 0.190 | 1200 |
| ViewCrafter [29] | 15.85 / 0.474 / 0.364 | 15.83 / 0.481 / 0.361 | 170 |
| Uni3C [2] | 16.77 / 0.514 / 0.263 | 16.54 / 0.502 / 0.274 | 340 |
| Ours (AnyRecon) | 20.37 / 0.639 / 0.158 | 20.30 / 0.629 / 0.181 | 105 |
Qualitative Results (Fig. 6 & 7): AnyRecon effectively completes missing regions and hallucinates plausible new content with structural and appearance consistency, outperforming baselines that show artifacts, color shifts, or geometric inconsistencies.
Ablation Studies:
- Temporal Compression (TC): Ablation shows full or partial TC degrades quality by discarding high-frequency details (Fig. 3c,d). Non-compressive encoding is essential.
- Efficiency Strategies: Combining 4-step distillation and sparse attention achieves a 20x speedup (90s vs. 1820s) with minimal quality drop (Table 2).
- Global Scene Memory: Conditioning only on rendered geometry (w/o memory) leads to texture loss and color shifts. Including raw captured views in memory preserves high-fidelity details (Table 3, Fig. 8).
Table 2: Ablation on Temporal Compression & Efficiency
| Configuration | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Time (s)* ↓ |
|---|---|---|---|---|
| 50 Steps, Full TC | 20.16 | 0.616 | 0.179 | 210 + (15) |
| 50 Steps, Partial TC | 21.10 | 0.661 | 0.153 | 270 + (15) |
| 50 Steps, w/o TC (Full Attention) | 21.57 | 0.687 | 0.140 | 1820 + (15) |
| 4 Steps, w/o TC (Full Attention) | 21.32 | 0.673 | 0.148 | 140 + (15) |
| 4 Steps, w/o TC (Sparse Attention) | 20.95 | 0.656 | 0.151 | 90 + (15) |
| *Time format: "DiT inference time + (encoder/decoder overhead 15s)" |
Table 3: Ablation on Global Scene Memory
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| w/o Global Scene Memory | 20.18 | 0.634 | 0.205 |
| w/ Global Scene Memory | 20.95 | 0.656 | 0.151 |
Theoretical and Practical Implications
- Theoretical: AnyRecon demonstrates the effectiveness of tightly coupling generative priors with explicit 3D geometry, forming a closed loop that mitigates error accumulation. It shows that breaking the temporal continuity assumption in video diffusion (via non-compressive encoding and global memory) is crucial for spatial tasks like reconstruction.
- Practical: The framework enables the conversion of casual, sparse real-world captures (e.g., from smartphones) into high-quality, explorable 3D assets. Its efficiency (105s per sequence) and scalability make it more practical for real-world applications compared to slower iterative refinement methods.
Conclusion
AnyRecon presents a scalable framework for robust 3D reconstruction from arbitrary sparse inputs. Its core innovations are:
- A video diffusion model supporting flexible conditioning via a global scene memory and non-compressive encoding.
- A geometry-aware conditioning loop with an explicit 3D memory and visibility-based view retrieval.
- Efficient design choices (sparse attention, 4-step distillation) enabling fast, high-fidelity synthesis.
The method significantly advances the state-of-the-art in sparse-view reconstruction, handling interpolation, extrapolation, and large-scale scenes consistently. A key limitation is its dependence on an initial geometry with basic structural coherence; failure in cases of minimal view overlap can lead to suboptimal results. Future work may focus on improving robustness under extreme sparsity and exploring more efficient 3D representations.