AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model - Summary

Summary (Overview)

Flexible Sparse-View Framework: AnyRecon is a scalable framework for 3D reconstruction from arbitrary, unordered, and sparse input views. It uses a video diffusion model that can condition on a flexible number of captured frames, overcoming the limitation of prior methods restricted to one or two views.
Geometry-Aware Conditioning: The method introduces a closed-loop strategy that couples generation and reconstruction through an explicit, incrementally updated 3D Geometry Memory and a geometry-driven view selection mechanism based on spatial overlap and visibility, ensuring generation is guided by spatially informative observations.
Efficient Architecture: To handle long-range conditioning and large scenes efficiently, AnyRecon employs a non-compressive latent encoding (removing temporal compression) to preserve frame-level details, context-window sparse attention to reduce quadratic complexity, and 4-step diffusion distillation for fast inference, achieving up to a 20x speedup.
Superior Performance: Extensive experiments on DL3DV and Tanks and Temples datasets show AnyRecon outperforms state-of-the-art baselines (DifiX3D+, ViewCrafter, Uni3C) in interpolation and extrapolation tasks, delivering higher fidelity and consistency with significantly reduced inference time (~105 seconds per 40-frame sequence).

Introduction and Theoretical Foundation

Novel view synthesis and 3D reconstruction from sparse, casual captures (e.g., handheld videos) remains challenging. While neural representations like NeRF and 3D Gaussian Splatting offer high fidelity, they require dense, controlled multi-view inputs. Recent diffusion-based approaches mitigate sparsity by synthesizing novel views but face key limitations:

Limited Conditioning: Many methods condition on only one or two captured RGB frames, weakening appearance fidelity and global context.
Implicit Geometry: Methods relying solely on RGB images and poses struggle with precise spatial alignment.
Scalability: Existing video diffusion frameworks are suboptimal for non-sequential inputs with large viewpoint gaps and cannot process large scenes all at once.

AnyRecon aims to enable high-quality, large-scale 3D reconstruction from sparse, arbitrary inputs. Its theoretical foundation rests on creating a persistent global scene memory from captured views and coupling the generative diffusion process with an explicit 3D geometric representation to maintain strict spatial control and consistency across long trajectories.

Methodology

The pipeline operates in an iterative generation-reconstruction loop (Fig. 2). Key components are:

1. Unordered Contextual Video Diffusion

Inputs: For a target trajectory segment, the model takes selected captured views $I_{sel}$ , and rendered geometric guidance $I_{render}$ (from a 3D point cloud) under target viewpoints $V_{novel}$ .
Global Scene Memory: Retrieved reference views $I_{sel}$ are prepended to the sequence, forming a persistent Key-Value (KV) memory cache within the transformer, enabling flexible long-range conditioning.
Non-Compressive Latent Encoding: Uses a frame-wise 2D VAE instead of a temporally compressive 3D-VAE. This preserves a one-to-one mapping between latent tokens and pixel coordinates, crucial for handling large viewpoint gaps without feature entanglement.

2. Efficient Sparse Attention & 4-Step Sampling

Context-Window Sparse Attention: To manage the expanded token sequence length $L$ and avoid $O(L^2)$ complexity, each target frame attends only to a local temporal window and the retrieved subset of reference views $I_{sel}$ .
4-Step Diffusion Sampling: Employs Distribution Matching Distillation (DMD) to distill a pre-trained model into a student network for fast, 4-step inference. The generator loss $L_{gen}$ and critic loss $L_{critic}$ are: $L_{gen} = \mathbb{E}_{z_t,t} \left[ \frac{1}{2} \left\| \hat{x}_{\theta}(z_t) - \text{sg}\left( \hat{x}_{\theta}(z_t) + \eta \frac{\hat{x}_{\psi}(z_t) - \hat{x}_{\phi}(z_t)}{\sigma_{\text{norm}}} \right) \right\|_2^2 \right]$ $L_{critic} = \mathbb{E}_{z_t,t} \left[ \| \hat{x}_{\phi}(z_t) - x_{\text{clean}} \|_2^2 \right]$ where $\hat{x}_{\theta}$ , $\hat{x}_{\psi}$ , $\hat{x}_{\phi}$ are denoised predictions from student, teacher, and critic respectively; $\eta$ is step size; $\sigma_{\text{norm}}$ is a normalization factor; and $\text{sg}$ is stop-gradient.

3. Geometry-Aware Conditioning Strategy

3D Geometry Memory Update: An explicit point cloud $M_{geo}$ is initialized from sparse inputs. After generating novel views $\hat{I}_{novel}$ for a segment, a feed-forward model (e.g., $\pi^3$ [22]) extracts new 3D points from these views to update $M_{geo}$ . This incremental update prevents geometric drift across segments.
Geometry-Driven View Selection: Instead of using image similarity or FOV heuristics, views are selected from the capture bank $I_{cap}$ based on their geometric contribution to the target viewpoint. For each candidate view $i$ , a score $s_i$ is computed: $s_i = \frac{|V_{novel} \cap S_i|}{|V_{novel}|}$ where $V_{novel}$ is the set of points visible from the target view, and $S_i$ is the subset of points in $M_{geo}$ reconstructed from capture view $i$ . The top- $k$ views are selected as $I_{sel}$ .

Empirical Validation / Results

Datasets & Training: Trained on DL3DV-10K [11]. For each 40-frame clip, $N \in [2,4]$ conditioning views are randomly selected, with 50% probability from the first 20 frames (narrow-baseline) and 50% from the entire clip (wide-baseline).

Implementation: The Wan2.1-I2V-14B [18] model is fine-tuned using LoRA (rank 32). Training involves: 1) Full attention fine-tuning (100k iterations), 2) Sparse attention warm-up (10k iterations, block size $2 \times 8 \times 8$ ), 3) DMD2 distillation (30k iterations).

Comparison with State-of-the-Art: Evaluated on DL3DV-Evaluation and Tanks and Temples [9] under Interpolation (conditioning on frames 1, 21, 40) and Extrapolation (conditioning on frames 1, 11, 21, 31) settings. Metrics: PSNR, SSIM, LPIPS.

Table 1: Quantitative Comparison Results

Method	Interpolation (PSNR↑/SSIM↑/LPIPS↓)	Extrapolation (PSNR↑/SSIM↑/LPIPS↓)	Time (s) ↓
DL3DV Dataset
DifiX3D+ [23]	17.88 / 0.551 / 0.290	18.74 / 0.576 / 0.261	1200
ViewCrafter [29]	15.86 / 0.463 / 0.394	15.51 / 0.459 / 0.406	170
Uni3C [2]	16.33 / 0.471 / 0.319	15.69 / 0.457 / 0.344	340
Ours (AnyRecon)	20.95 / 0.656 / 0.151	21.16 / 0.660 / 0.158	105
Tanks and Temples Dataset
DifiX3D+ [23]	19.43 / 0.629 / 0.163	18.67 / 0.594 / 0.190	1200
ViewCrafter [29]	15.85 / 0.474 / 0.364	15.83 / 0.481 / 0.361	170
Uni3C [2]	16.77 / 0.514 / 0.263	16.54 / 0.502 / 0.274	340
Ours (AnyRecon)	20.37 / 0.639 / 0.158	20.30 / 0.629 / 0.181	105

Qualitative Results (Fig. 6 & 7): AnyRecon effectively completes missing regions and hallucinates plausible new content with structural and appearance consistency, outperforming baselines that show artifacts, color shifts, or geometric inconsistencies.

Ablation Studies:

Temporal Compression (TC): Ablation shows full or partial TC degrades quality by discarding high-frequency details (Fig. 3c,d). Non-compressive encoding is essential.
Efficiency Strategies: Combining 4-step distillation and sparse attention achieves a 20x speedup (90s vs. 1820s) with minimal quality drop (Table 2).
Global Scene Memory: Conditioning only on rendered geometry (w/o memory) leads to texture loss and color shifts. Including raw captured views in memory preserves high-fidelity details (Table 3, Fig. 8).

Table 2: Ablation on Temporal Compression & Efficiency

Configuration	PSNR ↑	SSIM ↑	LPIPS ↓	Time (s)* ↓
50 Steps, Full TC	20.16	0.616	0.179	210 + (15)
50 Steps, Partial TC	21.10	0.661	0.153	270 + (15)
50 Steps, w/o TC (Full Attention)	21.57	0.687	0.140	1820 + (15)
4 Steps, w/o TC (Full Attention)	21.32	0.673	0.148	140 + (15)
4 Steps, w/o TC (Sparse Attention)	20.95	0.656	0.151	90 + (15)
*Time format: "DiT inference time + (encoder/decoder overhead 15s)"

Table 3: Ablation on Global Scene Memory

Method	PSNR ↑	SSIM ↑	LPIPS ↓
w/o Global Scene Memory	20.18	0.634	0.205
w/ Global Scene Memory	20.95	0.656	0.151

Theoretical and Practical Implications

Theoretical: AnyRecon demonstrates the effectiveness of tightly coupling generative priors with explicit 3D geometry, forming a closed loop that mitigates error accumulation. It shows that breaking the temporal continuity assumption in video diffusion (via non-compressive encoding and global memory) is crucial for spatial tasks like reconstruction.
Practical: The framework enables the conversion of casual, sparse real-world captures (e.g., from smartphones) into high-quality, explorable 3D assets. Its efficiency (105s per sequence) and scalability make it more practical for real-world applications compared to slower iterative refinement methods.

Conclusion

AnyRecon presents a scalable framework for robust 3D reconstruction from arbitrary sparse inputs. Its core innovations are:

A video diffusion model supporting flexible conditioning via a global scene memory and non-compressive encoding.
A geometry-aware conditioning loop with an explicit 3D memory and visibility-based view retrieval.
Efficient design choices (sparse attention, 4-step distillation) enabling fast, high-fidelity synthesis.

The method significantly advances the state-of-the-art in sparse-view reconstruction, handling interpolation, extrapolation, and large-scale scenes consistently. A key limitation is its dependence on an initial geometry with basic structural coherence; failure in cases of minimal view overlap can lead to suboptimal results. Future work may focus on improving robustness under extreme sparsity and exploring more efficient 3D representations.