Summary of "DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction"

Summary (Overview)

Novel Framework: Proposes DiffNR, a novel optimization framework that enhances Neural Representations (NRs) (e.g., neural fields, 3D Gaussians) for sparse-view CT reconstruction by integrating a conditional diffusion prior in a "repair-and-augment" strategy.
Core Component: Introduces SliceFixer, a single-step diffusion model finetuned to correct artifacts in degraded 2D CT slices queried from NRs, conditioned on biplanar X-ray projections and a text prompt.
Efficient Integration: The diffusion model is not queried at every optimization step. Instead, it periodically generates pseudo-reference volumes that provide auxiliary 3D perceptual supervision, avoiding the computational overhead of iterative denoising-based methods.
Superior Performance: Achieves significant improvements over baselines (e.g., +5.79 dB PSNR for R²-Gaussian), demonstrates strong generalization to out-of-distribution (OOD) data, and maintains reasonable runtime compared to prior diffusion-based methods.
Practical Validation: Shows improved performance on downstream tasks like lung segmentation, highlighting the method's practical utility.

Introduction and Theoretical Foundation

Sparse-view CT (SVCT) reconstruction aims to recover high-quality 3D volumes from few projections to reduce radiation exposure. Existing optimization-based methods fall into two categories:

Neural Representation (NR) Methods: Model the volume as a learnable 3D field (e.g., neural fields like NAF, or 3D Gaussians like R²-Gaussian) and optimize via differentiable rendering. They are efficient but suffer from severe artifacts in underconstrained regions under sparse views.
Neural Prior (NP) Methods: Use pretrained networks (e.g., unconditional 2D diffusion models) as data-driven priors and embed solvers into iterative denoising. While effective, they suffer from inter-slice jitters, hallucinations, and long processing times.

Motivation: The paper proposes to marry neural representations with diffusion models to get the best of both worlds: the volumetric consistency of a unified 3D representation and the powerful prior of pretrained 2D foundation models. The key challenge is developing an NR-aware diffusion model and integrating it efficiently.

Theoretical Foundation: The work is built upon:

X-ray Imaging Physics (Beer-Lambert Law): For a ray $r(s) = o + sd$ , the logarithmic projection value is: $I(r) = \int_{s_n}^{s_f} \sigma(r(s)) \, ds$ where $\sigma(v)$ is the density field to be recovered.
Neural Representations:
- Neural Fields (NAF): An MLP $f$ outputs density $\sigma_f(v)$ . Rendering uses a discrete integral: $I_f(r) = \sum_{i=1}^{P} \sigma_f(r(s_i)) \cdot (r(s_{i+1}) - r(s_i))$ .
- 3D Gaussians (R²-Gaussian): Density is a mixture: $\sigma_g(v) = \sum_{i=1}^{M} G_i^3(v)$ , where each Gaussian is defined as: $G_i^3(v) = \rho_i \exp\left(-\frac{1}{2}(v - p_i)^\top \Sigma_i^{-1} (v - p_i)\right)$
Diffusion Models: The paper utilizes single-step diffusion models (SD-Turbo), which distill the multi-step denoising process for fast inference. The training objective is score matching: $\mathbb{E}_{x \sim p_{\text{data}}, t \sim p_t, \epsilon \sim \mathcal{N}(0,1)} \left[ \lVert \epsilon - \epsilon_\theta(x_t; c, t) \rVert_2^2 \right]$ where $c$ denotes conditioning information.

Methodology

The DiffNR framework consists of three main components:

1. SliceFixer: Diffusion Model for Slice Repairing

Purpose: Correct artifacts in degraded axial slices $\tilde{S}$ queried from NRs, outputting refined slices $\hat{S}$ .
Architecture: Built upon SD-Turbo. A VAE encodes the corrupted slice, a U-Net predicts target latents, and the decoder reconstructs the refined slice.
Conditioning: To preserve anatomical structures, the model is conditioned on:
- Biplanar X-ray projections $(I_a, I_b)$ : Provide global structural cues, encoded using RAD-DINO.
- Text prompt $c_t$ : Provides high-level semantic guidance (e.g., "Remove artifacts for this [Organ] CT slice."). The combined conditioning is $c = \text{Embed}(I_a, I_b, c_t)$ .
Finetuning: The pretrained SD-Turbo is adapted using:
- LoRA adapters injected into VAE and U-Net.
- Zero-convolution skip connections between encoder and decoder.
Training Loss: The total objective for finetuning SliceFixer is: $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{L2}} + \mathcal{L}_{\text{LPIPS}} + \lambda_{\text{CLIP}} \mathcal{L}_{\text{CLIP}} + \lambda_{\text{GAN}} \mathcal{L}_{\text{GAN}} + \lambda_{\text{SSIM}} \mathcal{L}_{\text{SSIM}}$

2. Data Curation for SliceFixer Training

Since no paired dataset exists, one is synthesized:

View Distribution: Simulate sparse-view scenarios by sampling subsets (uniform/non-uniform) from dense synthetic projections, introducing diverse artifact patterns.
Model Underfitting: Intentionally limit NR training iterations to produce volumes with pronounced artifacts.
Mixed Neural Representation: Use reconstructions from both neural fields and 3D Gaussians (1:1 ratio) to encourage generalization.

3. DiffNR Optimization Pipeline (Algorithm 1)

The core "repair-and-augment" strategy integrates SliceFixer into NR optimization (see Figure 3):

Stage 1 (Standard NR Optimization): Optimize the NR (NAF or R²-Gaussian) using:
- Image losses (L1 and SSIM) between rendered projections $\tilde{I}_i$ and measured projections $I_i$ .
- Low-level 3D regularization (Total Variation on a queried sub-volume $\tilde{V}_{tv}$ ).
Stage 2 (Diffusion-Enhanced Augmentation):
- Every $\ell$ iterations, query a volume $\tilde{V}_\ell$ from the current NR.
- Upsample each slice, apply SliceFixer for artifact correction, and downsample back to original resolution, forming a pseudo-reference volume $\hat{V}_\ell$ .
- Every $\tau$ steps, compute a 3D perceptual supervision loss between the currently queried volume $\tilde{V}$ and $\hat{V}_\ell$ .
- The loss is a 3D SSIM (average of 2D SSIM across axial, sagittal, coronal planes), weighted by $\lambda_{\text{diff}}$ . This promotes structural coherence over voxel-wise fitting to potentially hallucinated details.

This strategy provides auxiliary 3D supervision to fix underconstrained regions while avoiding frequent, expensive diffusion model queries.

Empirical Validation / Results

Experiments were conducted on ToothFairy (dental) and LUNA16 (chest) datasets, with sparse-view settings of 36, 24, and 12 projections.

Quantitative Results (In-Distribution)

Table 1 shows comprehensive comparisons:

Methods	ToothFairy (36-view)	LUNA16 (36-view)	TIME
Traditional
SART	27.41 / 0.581	22.34 / 0.438	1m25s
ASD-POCS	29.65 / 0.775	23.93 / 0.661	48s
Diffusion-Based Iterative
DiffusionMBIR	33.29 / 0.856	29.35 / 0.781	11h15m
DDS	32.56 / 0.817	26.21 / 0.554	16m17s
Neural Representation
SAX-NeRF	28.48 / 0.835	23.72 / 0.704	4h9m
NAF	28.62 / 0.833	23.85 / 0.712	7m15s
+ DiffNR (Ours)	31.27 / 0.951	26.27 / 0.867	8m41s
R²-Gaussian	28.56 / 0.695	24.11 / 0.577	5m52s
+ DiffNR (Ours)	33.52 / 0.900	28.82 / 0.822	11m35s

Key Findings:
- DiffNR consistently enhances NR baselines: +2.19 dB average PSNR for NAF, +5.79 dB for R²-Gaussian.
- It outperforms prior diffusion-based SOTA (DiffusionMBIR) in quality on LUNA16 while being orders of magnitude faster (minutes vs. hours).
- The SSIM gains are particularly significant, indicating superior structural recovery.

Out-of-Distribution (OOD) Generalization

Table 2 shows results on a diverse OOD dataset (human organs, specimens, artificial objects) using SliceFixer trained only on ToothFairy:

Methods	36-view PSNR / SSIM
R²-Gaussian	35.64 / 0.904
+ DiffNR (Ours)	35.99 / 0.918
DiffusionMBIR	33.26 / 0.839

DiffNR outperforms others, suppressing hallucinations and artifacts, demonstrating that SliceFixer learns generalizable artifact patterns.

Downstream Application: Lung Segmentation

Table 3 validates utility on a medical task (lung segmentation on LUNA16 volumes):

Methods	36-view Dice ↑ / ASD ↓
R²-Gaussian	90.41 / 5.19
+ DiffNR (Ours)	93.74 / 3.85
DiffusionMBIR	90.33 / 6.13

DiffNR produces volumes that lead to segmentation masks more consistent with ground truth (higher Dice, lower Average Surface Distance).

Qualitative Results

Figure 4 & 5 visually demonstrate that DiffNR recovers finer anatomical details and effectively suppresses streaking and blurring artifacts present in baseline NR and other methods, across various sparsity levels and datasets.

Ablation Studies

SliceFixer Design (Table 4)

Ablation on LUNA16 36-view case with R²-Gaussian backbone:

ID	Resolution	SD-Turbo Pretrain	$\mathcal{L}_{\text{ssim}}$	Bip. Proj.	PSNR	SSIM
(1)	256	✓			27.65	0.789
(2)	512	✓			27.91	0.807
(3)	512	✓	✓		28.21	0.814
(4)	512	✓	✓	✓	28.82	0.822

Key findings:

Finetuning on 512² images with up/downsampling is better than native 256².
Adding SSIM loss ( $\mathcal{L}_{\text{ssim}}$ ) gives a +0.3 dB gain.
Biplanar projection conditioning provides the largest boost (+0.6 dB).

DiffNR Design (Table 5)

Methods	PSNR	SSIM
R²-Gaussian	24.11	0.577
+ Difix3D+ (augment projection)	23.23	0.579
+ SliceFixer (post-processing only)	26.70	0.776
+ SliceFixer (with L1 loss)	26.42	0.678
+ SliceFixer (with SSIM loss) (Ours)	28.82	0.822

Key findings:

Augmenting with novel-view images (as in RGB NeRF works) is ineffective for volumetric CT.
Using SliceFixer as a standalone post-processor causes slice jitter (Figure 6c).
Integrating it into optimization is necessary.
The 3D SSIM perceptual loss is superior to voxel-wise L1, mitigating overfitting to diffusion hallucinations.

Hyperparameter Analysis (Table 6)

Analysis of 3D SSIM loss weight $\lambda_{\text{diff}}$ and supervision frequency $\tau$ :

$\lambda_{\text{diff}} = 0.5$ achieves the best balance.
$\tau = 10$ yields optimal performance; more frequent supervision increases cost, less frequent weakens guidance.

Conclusion

Main Contribution: DiffNR presents a novel and effective framework that enhances neural representation optimization for sparse-view CT by integrating a conditional diffusion prior via a repair-and-augment strategy.
Key Advantages: Achieves significant improvements in reconstruction quality (PSNR/SSIM), demonstrates strong generalization, and maintains computational efficiency compared to prior diffusion-based methods.
Broader Impact: The integration of diffusion models with neural representation optimization opens a promising direction for addressing a wider class of inverse problems beyond tomographic reconstruction.
Future Directions: The method's success suggests potential applications in other 3D imaging modalities and inverse problems where combining explicit 3D representations with powerful 2D priors is beneficial.