Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Summary (Overview)

  • Novel Denoising Space: Proposes Geometry-Aware Representation Denoising (GARD), a framework that performs multi-view restoration via diffusion directly in the geometry-aware feature space of a frozen feed-forward 3D reconstruction model, rather than in pixel space or compressed VAE latents.
  • Joint Recovery: Enables simultaneous recovery of accurate 3D scene geometry (depth, pose) and high-quality multi-view RGB images through dedicated decoders from the denoised representations.
  • Enhanced Robustness: Addresses the vulnerability of feed-forward 3D reconstructors to real-world degradations (e.g., motion blur) by denoising the intermediate features that are crucial for geometric reasoning.
  • Superior Performance: Demonstrates state-of-the-art results on the Depth Anything 3 (DA3) benchmark across pose estimation, 3D reconstruction, and image restoration tasks under severe degradation.
  • Key Components: Employs an interpolated flow matching loss (starting from noisy degraded features) and an attention alignment loss to enforce cross-view geometric consistency during denoising.

Introduction and Theoretical Foundation

Multi-view 3D reconstruction is a fundamental computer vision task with applications in navigation, robotics, and AR/VR. While recent feed-forward models (e.g., Depth Anything 3) have advanced the field by directly inferring scene geometry from multi-view images, their performance degrades significantly under real-world degradations like motion blur. These degradations obscure fine textures and structural cues, disrupting the geometric consistency learned by the models.

Existing approaches to improve robustness follow a "restore-then-reconstruct" paradigm, where degraded images are restored before being fed to the reconstructor. However, these have key limitations:

  1. Single-view restoration models cannot leverage complementary multi-view information or enforce cross-view consistency.
  2. Multi-view restoration models (including recent diffusion-based ones) often operate in heavily compressed VAE latent spaces, which act as information bottlenecks, losing fine-grained details essential for geometry.

Theoretical motivation comes from advances in Representation Autoencoders (RAEs), which show that high-dimensional, semantically rich representations (like those from pretrained encoders) are superior to compressed VAE latents for preserving structural and semantic information. Feed-forward 3D reconstructors naturally learn geometry-aware feature representations through cross-view transformer attention. The paper posits that performing restoration directly in this geometry-aware feature space is a more suitable domain for recovering both accurate geometry and high-quality imagery.

Methodology

The GARD framework denoises the intermediate features of a frozen feed-forward reconstructor F()\mathcal{F}(\cdot), which consists of a multi-view encoder E()\mathcal{E}(\cdot) and a geometry decoder D()\mathcal{D}(\cdot).

1. Task Formulation

Given VV degraded multi-view images IdegRV×H×W×3\mathbf{I}^{\text{deg}} \in \mathbb{R}^{V \times H \times W \times 3}, the goal is to recover:

  • Restored images IresRV×H×W×3\mathbf{I}^{\text{res}} \in \mathbb{R}^{V \times H \times W \times 3}
  • Underlying 3D scene geometry G={Gdepth,Gpose}\mathcal{G} = \{\mathcal{G}^{\text{depth}}, \mathcal{G}^{\text{pose}}\}, where GdepthRV×H×W×1\mathcal{G}^{\text{depth}} \in \mathbb{R}^{V \times H \times W \times 1} and GposeRV×9\mathcal{G}^{\text{pose}} \in \mathbb{R}^{V \times 9}.

2. Framework Overview

The process is illustrated in Figure 3(a):

  1. Degraded inputs Ideg\mathbf{I}^{\text{deg}} are encoded by the LL-layer encoder E()\mathcal{E}(\cdot) to produce layer-wise features {zdegl}l=1L\{\mathbf{z}^l_{\text{deg}}\}_{l=1}^L.
  2. At a specific layer KK (chosen as K=18K=18), the GARD denoiser Sθ()S_\theta(\cdot) refines the degraded feature: zresK=Sθ(zdegK)\mathbf{z}^K_{\text{res}} = S_\theta(\mathbf{z}^K_{\text{deg}}).
  3. This refined feature is propagated through the remaining encoder layers to produce restored features {zresl}l=KL\{\mathbf{z}^l_{\text{res}}\}_{l=K}^L.
  4. A set of four feature levels Zres={zresl}lM\mathcal{Z}_{\text{res}} = \{\mathbf{z}^l_{\text{res}}\}_{l \in \mathcal{M}} (with M=4|\mathcal{M}|=4) is extracted.
  5. Two decoders process Zres\mathcal{Z}_{\text{res}}:
    • The original geometry decoder D()\mathcal{D}(\cdot) predicts G\mathcal{G}.
    • A separately trained RGB image decoder Drgb()\mathcal{D}_{\text{rgb}}(\cdot) (adapted from [17]) predicts Ires\mathbf{I}^{\text{res}}.

3. GARD Denoiser Architecture & Training

The denoiser Sθ()S_\theta(\cdot) is a multi-view diffusion model based on the DiTDH^\text{DH} design from RAE, augmented with global attention layers to enable cross-view context aggregation (Fig. 3(c)).

Interpolated Flow Matching Loss: Instead of starting denoising from pure Gaussian noise, the process begins from a noise-perturbed degraded feature, which retains some structural prior:

z~degK=zdegK+αϵ,ϵN(0,I),α[0,1]\tilde{\mathbf{z}}^K_{\text{deg}} = \mathbf{z}^K_{\text{deg}} + \alpha \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), \alpha \in [0,1]

The training objective is:

Lflow=Et,zdegK,zcleanK[v(zt,t)v(zt,t)22]\mathcal{L}_{\text{flow}} = \mathbb{E}_{t, \mathbf{z}^K_{\text{deg}}, \mathbf{z}^K_{\text{clean}}} \left[ \| \mathbf{v}(\mathbf{z}_t, t) - \mathbf{v}^*(\mathbf{z}_t, t) \|_2^2 \right]

where zt=(1t)z~degK+tzcleanK\mathbf{z}_t = (1-t)\tilde{\mathbf{z}}^K_{\text{deg}} + t \mathbf{z}^K_{\text{clean}}, tU(0,1)t \sim \mathcal{U}(0,1), the predicted velocity is v(zt,t)=Sθ(zt,t)\mathbf{v}(\mathbf{z}_t, t)=S_\theta(\mathbf{z}_t, t), and the ground-truth velocity is v(zt,t)=zcleanKz~degK\mathbf{v}^*(\mathbf{z}_t, t) = \mathbf{z}^K_{\text{clean}} - \tilde{\mathbf{z}}^K_{\text{deg}}.

Attention Alignment Loss: To explicitly encourage learning of cross-view correspondences, the global attention maps in the denoiser are aligned with geometrically consistent target correspondence maps A\mathbf{A}^* derived from clean input point clouds:

Lattn=E[AlogAJ]\mathcal{L}_{\text{attn}} = -\mathbb{E} \left[ \mathbf{A}^* \log \mathbf{A}^J \right]

where AJ\mathbf{A}^J is the global attention map at layer JJ of the denoiser.

The total loss is: L=Lflow+λattnLattn\mathcal{L} = \mathcal{L}_{\text{flow}} + \lambda_{\text{attn}} \mathcal{L}_{\text{attn}}.

Empirical Validation / Results

Experiments are conducted on the Depth Anything 3 (DA3) benchmark under severe motion blur degradation. The feed-forward reconstructor used is DA3-GIANT-1.1. Comparisons are made against:

  • Single-view restoration baselines: Restormer, HI-Diff, InstructIR, MoCE-IR.
  • Multi-view restoration baselines: Video restoration models (VRT, FMA-Net) and a VAE-based multi-view denoiser (VAE_MVD).

1. Pose Estimation

GARD significantly outperforms all baselines in camera pose estimation accuracy (AUC5, AUC30).

Table 1: Quantitative Pose Estimation Results (AUC5 ↑ / AUC30 ↑)

ModelHiRoomETH3DDTU7ScenesScanNet++
HQ Input87.20 / 96.6553.45 / 84.6892.44 / 98.7042.47 / 86.9182.66 / 92.95
LQ Input4.10 / 32.9016.72 / 61.3820.83 / 66.437.55 / 51.3934.55 / 71.02
GARD (Ours)12.00 / 67.2235.75 / 74.6862.24 / 92.3735.55 / 84.7356.44 / 87.45
Best Baseline3.89 / 30.1721.73 / 63.2554.80 / 85.9124.94 / 76.5039.20 / 76.12

Qualitatively, GARD produces more accurate and consistent camera trajectories (Fig. 5).

cube2. 3D Reconstruction

GARD achieves the best reconstruction quality (lower Overall error, higher F-score).

Table 2: Quantitative 3D Reconstruction Results (Overall ↓ / F-score ↑)

ModelHiRoomETH3DDTU7ScenesScanNet++
HQ Input0.069 / 84.050.812 / 60.812.475 / -0.159 / 45.150.265 / 50.25
LQ Input1.634 / 11.741.564 / 37.506.611 / -0.363 / 18.400.335 / 24.13
GARD (Ours)0.293 / 18.251.136 / 45.794.760 / -0.190 / 36.080.277 / 35.77
Best Baseline0.750 / 12.411.493 / 37.155.563 / -0.259 / 29.800.319 / 30.45

Qualitatively, GARD produces more complete and accurate 3D point clouds (Fig. 6).

3. Image Restoration

GARD also achieves the best image restoration quality (higher PSNR, lower LPIPS).

Table 3: Quantitative Image Restoration Results (PSNR ↑ / LPIPS ↓)

ModelHiRoomETH3DDTU7ScenesScanNet++
GARD (Ours)21.89 / 0.36221.88 / 0.63521.25 / 0.41822.67 / 0.24922.19 / 0.345
Best Baseline19.76 / 0.49321.37 / 0.61120.54 / 0.43421.74 / 0.40421.50 / 0.379

Qualitatively, GARD recovers finer details and sharper images (Fig. 7).

4. Ablation Studies

Ablation on Training Components (Table 4): The full model (with both Interpolated Flow and Attention Alignment) performs best. The combination is crucial: interpolated flow provides a structural prior from the degraded input, which the attention alignment can then effectively leverage.

Ablation on Number of Input Views (Table 5): Performance in both pose estimation and 3D reconstruction monotonically improves as the number of input views increases (from 4 to 50), demonstrating that richer cross-view information substantially benefits the geometric restoration.

Theoretical and Practical Implications

  • Theoretical: Validates the core hypothesis that geometry-aware feature spaces from feed-forward 3D reconstructors are a superior domain for multi-view restoration compared to image space or VAE latents. These spaces inherently preserve cross-view consistency and structural details.
  • Practical: Provides a robust framework for real-world 3D reconstruction where image degradations are common. It enables a unified pipeline for joint geometry and image recovery without retraining the base reconstruction model, offering a practical solution for applications in robotics, autonomous systems, and 3D content creation from imperfect captures.

Conclusion

The GARD framework successfully addresses the challenge of robust multi-view 3D reconstruction under degradation by performing diffusion-based denoising directly in the geometry-aware feature space of a feed-forward model. This approach leverages the model's inherent cross-view reasoning capabilities to jointly recover accurate 3D geometry and high-quality RGB images. Extensive experiments demonstrate state-of-the-art performance across multiple tasks and benchmarks.

Limitation and Future Work: The iterative nature of the diffusion denoiser limits inference speed. Future directions include:

  • Designing more efficient denoiser architectures.
  • Exploring multi-layer denoising strategies to further reduce error accumulation and improve quality.