Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
Summary (Overview)
- Novel Denoising Space: Proposes Geometry-Aware Representation Denoising (GARD), a framework that performs multi-view restoration via diffusion directly in the geometry-aware feature space of a frozen feed-forward 3D reconstruction model, rather than in pixel space or compressed VAE latents.
- Joint Recovery: Enables simultaneous recovery of accurate 3D scene geometry (depth, pose) and high-quality multi-view RGB images through dedicated decoders from the denoised representations.
- Enhanced Robustness: Addresses the vulnerability of feed-forward 3D reconstructors to real-world degradations (e.g., motion blur) by denoising the intermediate features that are crucial for geometric reasoning.
- Superior Performance: Demonstrates state-of-the-art results on the Depth Anything 3 (DA3) benchmark across pose estimation, 3D reconstruction, and image restoration tasks under severe degradation.
- Key Components: Employs an interpolated flow matching loss (starting from noisy degraded features) and an attention alignment loss to enforce cross-view geometric consistency during denoising.
Introduction and Theoretical Foundation
Multi-view 3D reconstruction is a fundamental computer vision task with applications in navigation, robotics, and AR/VR. While recent feed-forward models (e.g., Depth Anything 3) have advanced the field by directly inferring scene geometry from multi-view images, their performance degrades significantly under real-world degradations like motion blur. These degradations obscure fine textures and structural cues, disrupting the geometric consistency learned by the models.
Existing approaches to improve robustness follow a "restore-then-reconstruct" paradigm, where degraded images are restored before being fed to the reconstructor. However, these have key limitations:
- Single-view restoration models cannot leverage complementary multi-view information or enforce cross-view consistency.
- Multi-view restoration models (including recent diffusion-based ones) often operate in heavily compressed VAE latent spaces, which act as information bottlenecks, losing fine-grained details essential for geometry.
Theoretical motivation comes from advances in Representation Autoencoders (RAEs), which show that high-dimensional, semantically rich representations (like those from pretrained encoders) are superior to compressed VAE latents for preserving structural and semantic information. Feed-forward 3D reconstructors naturally learn geometry-aware feature representations through cross-view transformer attention. The paper posits that performing restoration directly in this geometry-aware feature space is a more suitable domain for recovering both accurate geometry and high-quality imagery.
Methodology
The GARD framework denoises the intermediate features of a frozen feed-forward reconstructor , which consists of a multi-view encoder and a geometry decoder .
1. Task Formulation
Given degraded multi-view images , the goal is to recover:
- Restored images
- Underlying 3D scene geometry , where and .
2. Framework Overview
The process is illustrated in Figure 3(a):
- Degraded inputs are encoded by the -layer encoder to produce layer-wise features .
- At a specific layer (chosen as ), the GARD denoiser refines the degraded feature: .
- This refined feature is propagated through the remaining encoder layers to produce restored features .
- A set of four feature levels (with ) is extracted.
- Two decoders process :
- The original geometry decoder predicts .
- A separately trained RGB image decoder (adapted from [17]) predicts .
3. GARD Denoiser Architecture & Training
The denoiser is a multi-view diffusion model based on the DiT design from RAE, augmented with global attention layers to enable cross-view context aggregation (Fig. 3(c)).
Interpolated Flow Matching Loss: Instead of starting denoising from pure Gaussian noise, the process begins from a noise-perturbed degraded feature, which retains some structural prior:
The training objective is:
where , , the predicted velocity is , and the ground-truth velocity is .
Attention Alignment Loss: To explicitly encourage learning of cross-view correspondences, the global attention maps in the denoiser are aligned with geometrically consistent target correspondence maps derived from clean input point clouds:
where is the global attention map at layer of the denoiser.
The total loss is: .
Empirical Validation / Results
Experiments are conducted on the Depth Anything 3 (DA3) benchmark under severe motion blur degradation. The feed-forward reconstructor used is DA3-GIANT-1.1. Comparisons are made against:
- Single-view restoration baselines: Restormer, HI-Diff, InstructIR, MoCE-IR.
- Multi-view restoration baselines: Video restoration models (VRT, FMA-Net) and a VAE-based multi-view denoiser (
VAE_MVD).
1. Pose Estimation
GARD significantly outperforms all baselines in camera pose estimation accuracy (AUC5, AUC30).
Table 1: Quantitative Pose Estimation Results (AUC5 ↑ / AUC30 ↑)
| Model | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ |
|---|---|---|---|---|---|
| HQ Input | 87.20 / 96.65 | 53.45 / 84.68 | 92.44 / 98.70 | 42.47 / 86.91 | 82.66 / 92.95 |
| LQ Input | 4.10 / 32.90 | 16.72 / 61.38 | 20.83 / 66.43 | 7.55 / 51.39 | 34.55 / 71.02 |
| GARD (Ours) | 12.00 / 67.22 | 35.75 / 74.68 | 62.24 / 92.37 | 35.55 / 84.73 | 56.44 / 87.45 |
| Best Baseline | 3.89 / 30.17 | 21.73 / 63.25 | 54.80 / 85.91 | 24.94 / 76.50 | 39.20 / 76.12 |
Qualitatively, GARD produces more accurate and consistent camera trajectories (Fig. 5).
cube2. 3D Reconstruction
GARD achieves the best reconstruction quality (lower Overall error, higher F-score).
Table 2: Quantitative 3D Reconstruction Results (Overall ↓ / F-score ↑)
| Model | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ |
|---|---|---|---|---|---|
| HQ Input | 0.069 / 84.05 | 0.812 / 60.81 | 2.475 / - | 0.159 / 45.15 | 0.265 / 50.25 |
| LQ Input | 1.634 / 11.74 | 1.564 / 37.50 | 6.611 / - | 0.363 / 18.40 | 0.335 / 24.13 |
| GARD (Ours) | 0.293 / 18.25 | 1.136 / 45.79 | 4.760 / - | 0.190 / 36.08 | 0.277 / 35.77 |
| Best Baseline | 0.750 / 12.41 | 1.493 / 37.15 | 5.563 / - | 0.259 / 29.80 | 0.319 / 30.45 |
Qualitatively, GARD produces more complete and accurate 3D point clouds (Fig. 6).
3. Image Restoration
GARD also achieves the best image restoration quality (higher PSNR, lower LPIPS).
Table 3: Quantitative Image Restoration Results (PSNR ↑ / LPIPS ↓)
| Model | HiRoom | ETH3D | DTU | 7Scenes | ScanNet++ |
|---|---|---|---|---|---|
| GARD (Ours) | 21.89 / 0.362 | 21.88 / 0.635 | 21.25 / 0.418 | 22.67 / 0.249 | 22.19 / 0.345 |
| Best Baseline | 19.76 / 0.493 | 21.37 / 0.611 | 20.54 / 0.434 | 21.74 / 0.404 | 21.50 / 0.379 |
Qualitatively, GARD recovers finer details and sharper images (Fig. 7).
4. Ablation Studies
Ablation on Training Components (Table 4): The full model (with both Interpolated Flow and Attention Alignment) performs best. The combination is crucial: interpolated flow provides a structural prior from the degraded input, which the attention alignment can then effectively leverage.
Ablation on Number of Input Views (Table 5): Performance in both pose estimation and 3D reconstruction monotonically improves as the number of input views increases (from 4 to 50), demonstrating that richer cross-view information substantially benefits the geometric restoration.
Theoretical and Practical Implications
- Theoretical: Validates the core hypothesis that geometry-aware feature spaces from feed-forward 3D reconstructors are a superior domain for multi-view restoration compared to image space or VAE latents. These spaces inherently preserve cross-view consistency and structural details.
- Practical: Provides a robust framework for real-world 3D reconstruction where image degradations are common. It enables a unified pipeline for joint geometry and image recovery without retraining the base reconstruction model, offering a practical solution for applications in robotics, autonomous systems, and 3D content creation from imperfect captures.
Conclusion
The GARD framework successfully addresses the challenge of robust multi-view 3D reconstruction under degradation by performing diffusion-based denoising directly in the geometry-aware feature space of a feed-forward model. This approach leverages the model's inherent cross-view reasoning capabilities to jointly recover accurate 3D geometry and high-quality RGB images. Extensive experiments demonstrate state-of-the-art performance across multiple tasks and benchmarks.
Limitation and Future Work: The iterative nature of the diffusion denoiser limits inference speed. Future directions include:
- Designing more efficient denoiser architectures.
- Exploring multi-layer denoising strategies to further reduce error accumulation and improve quality.