AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model - Summary

Summary (Overview)

  • Flexible Sparse-View Framework: AnyRecon is a scalable framework for 3D reconstruction from arbitrary, unordered, and sparse input views. It uses a video diffusion model that can condition on a flexible number of captured frames, overcoming the limitation of prior methods restricted to one or two views.
  • Geometry-Aware Conditioning: The method introduces a closed-loop strategy that couples generation and reconstruction through an explicit, incrementally updated 3D Geometry Memory and a geometry-driven view selection mechanism based on spatial overlap and visibility, ensuring generation is guided by spatially informative observations.
  • Efficient Architecture: To handle long-range conditioning and large scenes efficiently, AnyRecon employs a non-compressive latent encoding (removing temporal compression) to preserve frame-level details, context-window sparse attention to reduce quadratic complexity, and 4-step diffusion distillation for fast inference, achieving up to a 20x speedup.
  • Superior Performance: Extensive experiments on DL3DV and Tanks and Temples datasets show AnyRecon outperforms state-of-the-art baselines (DifiX3D+, ViewCrafter, Uni3C) in interpolation and extrapolation tasks, delivering higher fidelity and consistency with significantly reduced inference time (~105 seconds per 40-frame sequence).

Introduction and Theoretical Foundation

Novel view synthesis and 3D reconstruction from sparse, casual captures (e.g., handheld videos) remains challenging. While neural representations like NeRF and 3D Gaussian Splatting offer high fidelity, they require dense, controlled multi-view inputs. Recent diffusion-based approaches mitigate sparsity by synthesizing novel views but face key limitations:

  1. Limited Conditioning: Many methods condition on only one or two captured RGB frames, weakening appearance fidelity and global context.
  2. Implicit Geometry: Methods relying solely on RGB images and poses struggle with precise spatial alignment.
  3. Scalability: Existing video diffusion frameworks are suboptimal for non-sequential inputs with large viewpoint gaps and cannot process large scenes all at once.

AnyRecon aims to enable high-quality, large-scale 3D reconstruction from sparse, arbitrary inputs. Its theoretical foundation rests on creating a persistent global scene memory from captured views and coupling the generative diffusion process with an explicit 3D geometric representation to maintain strict spatial control and consistency across long trajectories.

Methodology

The pipeline operates in an iterative generation-reconstruction loop (Fig. 2). Key components are:

1. Unordered Contextual Video Diffusion

  • Inputs: For a target trajectory segment, the model takes selected captured views IselI_{sel}, and rendered geometric guidance IrenderI_{render} (from a 3D point cloud) under target viewpoints VnovelV_{novel}.
  • Global Scene Memory: Retrieved reference views IselI_{sel} are prepended to the sequence, forming a persistent Key-Value (KV) memory cache within the transformer, enabling flexible long-range conditioning.
  • Non-Compressive Latent Encoding: Uses a frame-wise 2D VAE instead of a temporally compressive 3D-VAE. This preserves a one-to-one mapping between latent tokens and pixel coordinates, crucial for handling large viewpoint gaps without feature entanglement.

2. Efficient Sparse Attention & 4-Step Sampling

  • Context-Window Sparse Attention: To manage the expanded token sequence length LL and avoid O(L2)O(L^2) complexity, each target frame attends only to a local temporal window and the retrieved subset of reference views IselI_{sel}.
  • 4-Step Diffusion Sampling: Employs Distribution Matching Distillation (DMD) to distill a pre-trained model into a student network for fast, 4-step inference. The generator loss LgenL_{gen} and critic loss LcriticL_{critic} are: Lgen=Ezt,t[12x^θ(zt)sg(x^θ(zt)+ηx^ψ(zt)x^ϕ(zt)σnorm)22]L_{gen} = \mathbb{E}_{z_t,t} \left[ \frac{1}{2} \left\| \hat{x}_{\theta}(z_t) - \text{sg}\left( \hat{x}_{\theta}(z_t) + \eta \frac{\hat{x}_{\psi}(z_t) - \hat{x}_{\phi}(z_t)}{\sigma_{\text{norm}}} \right) \right\|_2^2 \right] Lcritic=Ezt,t[x^ϕ(zt)xclean22]L_{critic} = \mathbb{E}_{z_t,t} \left[ \| \hat{x}_{\phi}(z_t) - x_{\text{clean}} \|_2^2 \right] where x^θ\hat{x}_{\theta}, x^ψ\hat{x}_{\psi}, x^ϕ\hat{x}_{\phi} are denoised predictions from student, teacher, and critic respectively; η\eta is step size; σnorm\sigma_{\text{norm}} is a normalization factor; and sg\text{sg} is stop-gradient.

3. Geometry-Aware Conditioning Strategy

  • 3D Geometry Memory Update: An explicit point cloud MgeoM_{geo} is initialized from sparse inputs. After generating novel views I^novel\hat{I}_{novel} for a segment, a feed-forward model (e.g., π3\pi^3 [22]) extracts new 3D points from these views to update MgeoM_{geo}. This incremental update prevents geometric drift across segments.
  • Geometry-Driven View Selection: Instead of using image similarity or FOV heuristics, views are selected from the capture bank IcapI_{cap} based on their geometric contribution to the target viewpoint. For each candidate view ii, a score sis_i is computed: si=VnovelSiVnovels_i = \frac{|V_{novel} \cap S_i|}{|V_{novel}|} where VnovelV_{novel} is the set of points visible from the target view, and SiS_i is the subset of points in MgeoM_{geo} reconstructed from capture view ii. The top-kk views are selected as IselI_{sel}.

Empirical Validation / Results

Datasets & Training: Trained on DL3DV-10K [11]. For each 40-frame clip, N[2,4]N \in [2,4] conditioning views are randomly selected, with 50% probability from the first 20 frames (narrow-baseline) and 50% from the entire clip (wide-baseline).

Implementation: The Wan2.1-I2V-14B [18] model is fine-tuned using LoRA (rank 32). Training involves: 1) Full attention fine-tuning (100k iterations), 2) Sparse attention warm-up (10k iterations, block size 2×8×82 \times 8 \times 8), 3) DMD2 distillation (30k iterations).

Comparison with State-of-the-Art: Evaluated on DL3DV-Evaluation and Tanks and Temples [9] under Interpolation (conditioning on frames 1, 21, 40) and Extrapolation (conditioning on frames 1, 11, 21, 31) settings. Metrics: PSNR, SSIM, LPIPS.

Table 1: Quantitative Comparison Results

MethodInterpolation (PSNR↑/SSIM↑/LPIPS↓)Extrapolation (PSNR↑/SSIM↑/LPIPS↓)Time (s) ↓
DL3DV Dataset
DifiX3D+ [23]17.88 / 0.551 / 0.29018.74 / 0.576 / 0.2611200
ViewCrafter [29]15.86 / 0.463 / 0.39415.51 / 0.459 / 0.406170
Uni3C [2]16.33 / 0.471 / 0.31915.69 / 0.457 / 0.344340
Ours (AnyRecon)20.95 / 0.656 / 0.15121.16 / 0.660 / 0.158105
Tanks and Temples Dataset
DifiX3D+ [23]19.43 / 0.629 / 0.16318.67 / 0.594 / 0.1901200
ViewCrafter [29]15.85 / 0.474 / 0.36415.83 / 0.481 / 0.361170
Uni3C [2]16.77 / 0.514 / 0.26316.54 / 0.502 / 0.274340
Ours (AnyRecon)20.37 / 0.639 / 0.15820.30 / 0.629 / 0.181105

Qualitative Results (Fig. 6 & 7): AnyRecon effectively completes missing regions and hallucinates plausible new content with structural and appearance consistency, outperforming baselines that show artifacts, color shifts, or geometric inconsistencies.

Ablation Studies:

  • Temporal Compression (TC): Ablation shows full or partial TC degrades quality by discarding high-frequency details (Fig. 3c,d). Non-compressive encoding is essential.
  • Efficiency Strategies: Combining 4-step distillation and sparse attention achieves a 20x speedup (90s vs. 1820s) with minimal quality drop (Table 2).
  • Global Scene Memory: Conditioning only on rendered geometry (w/o memory) leads to texture loss and color shifts. Including raw captured views in memory preserves high-fidelity details (Table 3, Fig. 8).

Table 2: Ablation on Temporal Compression & Efficiency

ConfigurationPSNR ↑SSIM ↑LPIPS ↓Time (s)* ↓
50 Steps, Full TC20.160.6160.179210 + (15)
50 Steps, Partial TC21.100.6610.153270 + (15)
50 Steps, w/o TC (Full Attention)21.570.6870.1401820 + (15)
4 Steps, w/o TC (Full Attention)21.320.6730.148140 + (15)
4 Steps, w/o TC (Sparse Attention)20.950.6560.15190 + (15)
*Time format: "DiT inference time + (encoder/decoder overhead 15s)"

Table 3: Ablation on Global Scene Memory

MethodPSNR ↑SSIM ↑LPIPS ↓
w/o Global Scene Memory20.180.6340.205
w/ Global Scene Memory20.950.6560.151

Theoretical and Practical Implications

  • Theoretical: AnyRecon demonstrates the effectiveness of tightly coupling generative priors with explicit 3D geometry, forming a closed loop that mitigates error accumulation. It shows that breaking the temporal continuity assumption in video diffusion (via non-compressive encoding and global memory) is crucial for spatial tasks like reconstruction.
  • Practical: The framework enables the conversion of casual, sparse real-world captures (e.g., from smartphones) into high-quality, explorable 3D assets. Its efficiency (105s per sequence) and scalability make it more practical for real-world applications compared to slower iterative refinement methods.

Conclusion

AnyRecon presents a scalable framework for robust 3D reconstruction from arbitrary sparse inputs. Its core innovations are:

  1. A video diffusion model supporting flexible conditioning via a global scene memory and non-compressive encoding.
  2. A geometry-aware conditioning loop with an explicit 3D memory and visibility-based view retrieval.
  3. Efficient design choices (sparse attention, 4-step distillation) enabling fast, high-fidelity synthesis.

The method significantly advances the state-of-the-art in sparse-view reconstruction, handling interpolation, extrapolation, and large-scale scenes consistently. A key limitation is its dependence on an initial geometry with basic structural coherence; failure in cases of minimal view overlap can lead to suboptimal results. Future work may focus on improving robustness under extreme sparsity and exploring more efficient 3D representations.