Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Summary (Overview)

Novel Denoising Space: Proposes Geometry-Aware Representation Denoising (GARD), a framework that performs multi-view restoration via diffusion directly in the geometry-aware feature space of a frozen feed-forward 3D reconstruction model, rather than in pixel space or compressed VAE latents.
Joint Recovery: Enables simultaneous recovery of accurate 3D scene geometry (depth, pose) and high-quality multi-view RGB images through dedicated decoders from the denoised representations.
Enhanced Robustness: Addresses the vulnerability of feed-forward 3D reconstructors to real-world degradations (e.g., motion blur) by denoising the intermediate features that are crucial for geometric reasoning.
Superior Performance: Demonstrates state-of-the-art results on the Depth Anything 3 (DA3) benchmark across pose estimation, 3D reconstruction, and image restoration tasks under severe degradation.
Key Components: Employs an interpolated flow matching loss (starting from noisy degraded features) and an attention alignment loss to enforce cross-view geometric consistency during denoising.

Introduction and Theoretical Foundation

Multi-view 3D reconstruction is a fundamental computer vision task with applications in navigation, robotics, and AR/VR. While recent feed-forward models (e.g., Depth Anything 3) have advanced the field by directly inferring scene geometry from multi-view images, their performance degrades significantly under real-world degradations like motion blur. These degradations obscure fine textures and structural cues, disrupting the geometric consistency learned by the models.

Existing approaches to improve robustness follow a "restore-then-reconstruct" paradigm, where degraded images are restored before being fed to the reconstructor. However, these have key limitations:

Single-view restoration models cannot leverage complementary multi-view information or enforce cross-view consistency.
Multi-view restoration models (including recent diffusion-based ones) often operate in heavily compressed VAE latent spaces, which act as information bottlenecks, losing fine-grained details essential for geometry.

Theoretical motivation comes from advances in Representation Autoencoders (RAEs), which show that high-dimensional, semantically rich representations (like those from pretrained encoders) are superior to compressed VAE latents for preserving structural and semantic information. Feed-forward 3D reconstructors naturally learn geometry-aware feature representations through cross-view transformer attention. The paper posits that performing restoration directly in this geometry-aware feature space is a more suitable domain for recovering both accurate geometry and high-quality imagery.

Methodology

The GARD framework denoises the intermediate features of a frozen feed-forward reconstructor $\mathcal{F}(\cdot)$ , which consists of a multi-view encoder $\mathcal{E}(\cdot)$ and a geometry decoder $\mathcal{D}(\cdot)$ .

1. Task Formulation

Given $V$ degraded multi-view images $\mathbf{I}^{\text{deg}} \in \mathbb{R}^{V \times H \times W \times 3}$ , the goal is to recover:

Restored images $\mathbf{I}^{\text{res}} \in \mathbb{R}^{V \times H \times W \times 3}$
Underlying 3D scene geometry $\mathcal{G} = \{\mathcal{G}^{\text{depth}}, \mathcal{G}^{\text{pose}}\}$ , where $\mathcal{G}^{\text{depth}} \in \mathbb{R}^{V \times H \times W \times 1}$ and $\mathcal{G}^{\text{pose}} \in \mathbb{R}^{V \times 9}$ .

2. Framework Overview

The process is illustrated in Figure 3(a):

Degraded inputs $\mathbf{I}^{\text{deg}}$ are encoded by the $L$ -layer encoder $\mathcal{E}(\cdot)$ to produce layer-wise features $\{\mathbf{z}^l_{\text{deg}}\}_{l=1}^L$ .
At a specific layer $K$ (chosen as $K=18$ ), the GARD denoiser $S_\theta(\cdot)$ refines the degraded feature: $\mathbf{z}^K_{\text{res}} = S_\theta(\mathbf{z}^K_{\text{deg}})$ .
This refined feature is propagated through the remaining encoder layers to produce restored features $\{\mathbf{z}^l_{\text{res}}\}_{l=K}^L$ .
A set of four feature levels $\mathcal{Z}_{\text{res}} = \{\mathbf{z}^l_{\text{res}}\}_{l \in \mathcal{M}}$ (with $|\mathcal{M}|=4$ ) is extracted.
Two decoders process $\mathcal{Z}_{\text{res}}$ $Z_{res}$ :
- The original geometry decoder $\mathcal{D}(\cdot)$ predicts $\mathcal{G}$ .
- A separately trained RGB image decoder $\mathcal{D}_{\text{rgb}}(\cdot)$ (adapted from [17]) predicts $\mathbf{I}^{\text{res}}$ .

3. GARD Denoiser Architecture & Training

The denoiser $S_\theta(\cdot)$ is a multi-view diffusion model based on the DiT $^\text{DH}$ design from RAE, augmented with global attention layers to enable cross-view context aggregation (Fig. 3(c)).

Interpolated Flow Matching Loss: Instead of starting denoising from pure Gaussian noise, the process begins from a noise-perturbed degraded feature, which retains some structural prior:

\tilde{\mathbf{z}}^K_{\text{deg}} = \mathbf{z}^K_{\text{deg}} + \alpha \epsilon, \quad \epsilon \sim \mathcal{N}(0, I), \alpha \in [0,1]

The training objective is:

\mathcal{L}_{\text{flow}} = \mathbb{E}_{t, \mathbf{z}^K_{\text{deg}}, \mathbf{z}^K_{\text{clean}}} \left[ \| \mathbf{v}(\mathbf{z}_t, t) - \mathbf{v}^*(\mathbf{z}_t, t) \|_2^2 \right]

where $\mathbf{z}_t = (1-t)\tilde{\mathbf{z}}^K_{\text{deg}} + t \mathbf{z}^K_{\text{clean}}$ , $t \sim \mathcal{U}(0,1)$ , the predicted velocity is $\mathbf{v}(\mathbf{z}_t, t)=S_\theta(\mathbf{z}_t, t)$ , and the ground-truth velocity is $\mathbf{v}^*(\mathbf{z}_t, t) = \mathbf{z}^K_{\text{clean}} - \tilde{\mathbf{z}}^K_{\text{deg}}$ .

Attention Alignment Loss: To explicitly encourage learning of cross-view correspondences, the global attention maps in the denoiser are aligned with geometrically consistent target correspondence maps $\mathbf{A}^*$ derived from clean input point clouds:

\mathcal{L}_{\text{attn}} = -\mathbb{E} \left[ \mathbf{A}^* \log \mathbf{A}^J \right]

where $\mathbf{A}^J$ is the global attention map at layer $J$ of the denoiser.

The total loss is: $\mathcal{L} = \mathcal{L}_{\text{flow}} + \lambda_{\text{attn}} \mathcal{L}_{\text{attn}}$ .

Empirical Validation / Results

Experiments are conducted on the Depth Anything 3 (DA3) benchmark under severe motion blur degradation. The feed-forward reconstructor used is DA3-GIANT-1.1. Comparisons are made against:

Single-view restoration baselines: Restormer, HI-Diff, InstructIR, MoCE-IR.
Multi-view restoration baselines: Video restoration models (VRT, FMA-Net) and a VAE-based multi-view denoiser (VAE_MVD).

1. Pose Estimation

GARD significantly outperforms all baselines in camera pose estimation accuracy (AUC5, AUC30).

Table 1: Quantitative Pose Estimation Results (AUC5 ↑ / AUC30 ↑)

Model	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
HQ Input	87.20 / 96.65	53.45 / 84.68	92.44 / 98.70	42.47 / 86.91	82.66 / 92.95
LQ Input	4.10 / 32.90	16.72 / 61.38	20.83 / 66.43	7.55 / 51.39	34.55 / 71.02
GARD (Ours)	12.00 / 67.22	35.75 / 74.68	62.24 / 92.37	35.55 / 84.73	56.44 / 87.45
Best Baseline	3.89 / 30.17	21.73 / 63.25	54.80 / 85.91	24.94 / 76.50	39.20 / 76.12

Qualitatively, GARD produces more accurate and consistent camera trajectories (Fig. 5).

cube2. 3D Reconstruction

GARD achieves the best reconstruction quality (lower Overall error, higher F-score).

Table 2: Quantitative 3D Reconstruction Results (Overall ↓ / F-score ↑)

Model	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
HQ Input	0.069 / 84.05	0.812 / 60.81	2.475 / -	0.159 / 45.15	0.265 / 50.25
LQ Input	1.634 / 11.74	1.564 / 37.50	6.611 / -	0.363 / 18.40	0.335 / 24.13
GARD (Ours)	0.293 / 18.25	1.136 / 45.79	4.760 / -	0.190 / 36.08	0.277 / 35.77
Best Baseline	0.750 / 12.41	1.493 / 37.15	5.563 / -	0.259 / 29.80	0.319 / 30.45

Qualitatively, GARD produces more complete and accurate 3D point clouds (Fig. 6).

3. Image Restoration

GARD also achieves the best image restoration quality (higher PSNR, lower LPIPS).

Table 3: Quantitative Image Restoration Results (PSNR ↑ / LPIPS ↓)

Model	HiRoom	ETH3D	DTU	7Scenes	ScanNet++
GARD (Ours)	21.89 / 0.362	21.88 / 0.635	21.25 / 0.418	22.67 / 0.249	22.19 / 0.345
Best Baseline	19.76 / 0.493	21.37 / 0.611	20.54 / 0.434	21.74 / 0.404	21.50 / 0.379

Qualitatively, GARD recovers finer details and sharper images (Fig. 7).

4. Ablation Studies

Ablation on Training Components (Table 4): The full model (with both Interpolated Flow and Attention Alignment) performs best. The combination is crucial: interpolated flow provides a structural prior from the degraded input, which the attention alignment can then effectively leverage.

Ablation on Number of Input Views (Table 5): Performance in both pose estimation and 3D reconstruction monotonically improves as the number of input views increases (from 4 to 50), demonstrating that richer cross-view information substantially benefits the geometric restoration.

Theoretical and Practical Implications

Theoretical: Validates the core hypothesis that geometry-aware feature spaces from feed-forward 3D reconstructors are a superior domain for multi-view restoration compared to image space or VAE latents. These spaces inherently preserve cross-view consistency and structural details.
Practical: Provides a robust framework for real-world 3D reconstruction where image degradations are common. It enables a unified pipeline for joint geometry and image recovery without retraining the base reconstruction model, offering a practical solution for applications in robotics, autonomous systems, and 3D content creation from imperfect captures.

Conclusion

The GARD framework successfully addresses the challenge of robust multi-view 3D reconstruction under degradation by performing diffusion-based denoising directly in the geometry-aware feature space of a feed-forward model. This approach leverages the model's inherent cross-view reasoning capabilities to jointly recover accurate 3D geometry and high-quality RGB images. Extensive experiments demonstrate state-of-the-art performance across multiple tasks and benchmarks.

Limitation and Future Work: The iterative nature of the diffusion denoiser limits inference speed. Future directions include:

Designing more efficient denoiser architectures.
Exploring multi-layer denoising strategies to further reduce error accumulation and improve quality.