Summary of "DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction"
Summary (Overview)
- Novel Framework: Proposes DiffNR, a novel optimization framework that enhances Neural Representations (NRs) (e.g., neural fields, 3D Gaussians) for sparse-view CT reconstruction by integrating a conditional diffusion prior in a "repair-and-augment" strategy.
- Core Component: Introduces SliceFixer, a single-step diffusion model finetuned to correct artifacts in degraded 2D CT slices queried from NRs, conditioned on biplanar X-ray projections and a text prompt.
- Efficient Integration: The diffusion model is not queried at every optimization step. Instead, it periodically generates pseudo-reference volumes that provide auxiliary 3D perceptual supervision, avoiding the computational overhead of iterative denoising-based methods.
- Superior Performance: Achieves significant improvements over baselines (e.g., +5.79 dB PSNR for R²-Gaussian), demonstrates strong generalization to out-of-distribution (OOD) data, and maintains reasonable runtime compared to prior diffusion-based methods.
- Practical Validation: Shows improved performance on downstream tasks like lung segmentation, highlighting the method's practical utility.
Introduction and Theoretical Foundation
Sparse-view CT (SVCT) reconstruction aims to recover high-quality 3D volumes from few projections to reduce radiation exposure. Existing optimization-based methods fall into two categories:
- Neural Representation (NR) Methods: Model the volume as a learnable 3D field (e.g., neural fields like NAF, or 3D Gaussians like R²-Gaussian) and optimize via differentiable rendering. They are efficient but suffer from severe artifacts in underconstrained regions under sparse views.
- Neural Prior (NP) Methods: Use pretrained networks (e.g., unconditional 2D diffusion models) as data-driven priors and embed solvers into iterative denoising. While effective, they suffer from inter-slice jitters, hallucinations, and long processing times.
Motivation: The paper proposes to marry neural representations with diffusion models to get the best of both worlds: the volumetric consistency of a unified 3D representation and the powerful prior of pretrained 2D foundation models. The key challenge is developing an NR-aware diffusion model and integrating it efficiently.
Theoretical Foundation: The work is built upon:
- X-ray Imaging Physics (Beer-Lambert Law): For a ray , the logarithmic projection value is: where is the density field to be recovered.
- Neural Representations:
- Neural Fields (NAF): An MLP outputs density . Rendering uses a discrete integral: .
- 3D Gaussians (R²-Gaussian): Density is a mixture: , where each Gaussian is defined as:
- Diffusion Models: The paper utilizes single-step diffusion models (SD-Turbo), which distill the multi-step denoising process for fast inference. The training objective is score matching: where denotes conditioning information.
Methodology
The DiffNR framework consists of three main components:
1. SliceFixer: Diffusion Model for Slice Repairing
- Purpose: Correct artifacts in degraded axial slices queried from NRs, outputting refined slices .
- Architecture: Built upon SD-Turbo. A VAE encodes the corrupted slice, a U-Net predicts target latents, and the decoder reconstructs the refined slice.
- Conditioning: To preserve anatomical structures, the model is conditioned on:
- Biplanar X-ray projections : Provide global structural cues, encoded using RAD-DINO.
- Text prompt : Provides high-level semantic guidance (e.g., "Remove artifacts for this [Organ] CT slice."). The combined conditioning is .
- Finetuning: The pretrained SD-Turbo is adapted using:
- LoRA adapters injected into VAE and U-Net.
- Zero-convolution skip connections between encoder and decoder.
- Training Loss: The total objective for finetuning SliceFixer is:
2. Data Curation for SliceFixer Training
Since no paired dataset exists, one is synthesized:
- View Distribution: Simulate sparse-view scenarios by sampling subsets (uniform/non-uniform) from dense synthetic projections, introducing diverse artifact patterns.
- Model Underfitting: Intentionally limit NR training iterations to produce volumes with pronounced artifacts.
- Mixed Neural Representation: Use reconstructions from both neural fields and 3D Gaussians (1:1 ratio) to encourage generalization.
3. DiffNR Optimization Pipeline (Algorithm 1)
The core "repair-and-augment" strategy integrates SliceFixer into NR optimization (see Figure 3):
- Stage 1 (Standard NR Optimization): Optimize the NR (NAF or R²-Gaussian) using:
- Image losses (L1 and SSIM) between rendered projections and measured projections .
- Low-level 3D regularization (Total Variation on a queried sub-volume ).
- Stage 2 (Diffusion-Enhanced Augmentation):
- Every iterations, query a volume from the current NR.
- Upsample each slice, apply SliceFixer for artifact correction, and downsample back to original resolution, forming a pseudo-reference volume .
- Every steps, compute a 3D perceptual supervision loss between the currently queried volume and .
- The loss is a 3D SSIM (average of 2D SSIM across axial, sagittal, coronal planes), weighted by . This promotes structural coherence over voxel-wise fitting to potentially hallucinated details.
This strategy provides auxiliary 3D supervision to fix underconstrained regions while avoiding frequent, expensive diffusion model queries.
Empirical Validation / Results
Experiments were conducted on ToothFairy (dental) and LUNA16 (chest) datasets, with sparse-view settings of 36, 24, and 12 projections.
Quantitative Results (In-Distribution)
Table 1 shows comprehensive comparisons:
| Methods | ToothFairy (36-view) | LUNA16 (36-view) | TIME |
|---|---|---|---|
| Traditional | |||
| SART | 27.41 / 0.581 | 22.34 / 0.438 | 1m25s |
| ASD-POCS | 29.65 / 0.775 | 23.93 / 0.661 | 48s |
| Diffusion-Based Iterative | |||
| DiffusionMBIR | 33.29 / 0.856 | 29.35 / 0.781 | 11h15m |
| DDS | 32.56 / 0.817 | 26.21 / 0.554 | 16m17s |
| Neural Representation | |||
| SAX-NeRF | 28.48 / 0.835 | 23.72 / 0.704 | 4h9m |
| NAF | 28.62 / 0.833 | 23.85 / 0.712 | 7m15s |
| + DiffNR (Ours) | 31.27 / 0.951 | 26.27 / 0.867 | 8m41s |
| R²-Gaussian | 28.56 / 0.695 | 24.11 / 0.577 | 5m52s |
| + DiffNR (Ours) | 33.52 / 0.900 | 28.82 / 0.822 | 11m35s |
- Key Findings:
- DiffNR consistently enhances NR baselines: +2.19 dB average PSNR for NAF, +5.79 dB for R²-Gaussian.
- It outperforms prior diffusion-based SOTA (DiffusionMBIR) in quality on LUNA16 while being orders of magnitude faster (minutes vs. hours).
- The SSIM gains are particularly significant, indicating superior structural recovery.
Out-of-Distribution (OOD) Generalization
Table 2 shows results on a diverse OOD dataset (human organs, specimens, artificial objects) using SliceFixer trained only on ToothFairy:
| Methods | 36-view PSNR / SSIM |
|---|---|
| R²-Gaussian | 35.64 / 0.904 |
| + DiffNR (Ours) | 35.99 / 0.918 |
| DiffusionMBIR | 33.26 / 0.839 |
DiffNR outperforms others, suppressing hallucinations and artifacts, demonstrating that SliceFixer learns generalizable artifact patterns.
Downstream Application: Lung Segmentation
Table 3 validates utility on a medical task (lung segmentation on LUNA16 volumes):
| Methods | 36-view Dice ↑ / ASD ↓ |
|---|---|
| R²-Gaussian | 90.41 / 5.19 |
| + DiffNR (Ours) | 93.74 / 3.85 |
| DiffusionMBIR | 90.33 / 6.13 |
DiffNR produces volumes that lead to segmentation masks more consistent with ground truth (higher Dice, lower Average Surface Distance).
Qualitative Results
Figure 4 & 5 visually demonstrate that DiffNR recovers finer anatomical details and effectively suppresses streaking and blurring artifacts present in baseline NR and other methods, across various sparsity levels and datasets.
Ablation Studies
SliceFixer Design (Table 4)
Ablation on LUNA16 36-view case with R²-Gaussian backbone:
| ID | Resolution | SD-Turbo Pretrain | Bip. Proj. | PSNR | SSIM | |
|---|---|---|---|---|---|---|
| (1) | 256 | ✓ | 27.65 | 0.789 | ||
| (2) | 512 | ✓ | 27.91 | 0.807 | ||
| (3) | 512 | ✓ | ✓ | 28.21 | 0.814 | |
| (4) | 512 | ✓ | ✓ | ✓ | 28.82 | 0.822 |
Key findings:
- Finetuning on 512² images with up/downsampling is better than native 256².
- Adding SSIM loss () gives a +0.3 dB gain.
- Biplanar projection conditioning provides the largest boost (+0.6 dB).
DiffNR Design (Table 5)
| Methods | PSNR | SSIM |
|---|---|---|
| R²-Gaussian | 24.11 | 0.577 |
| + Difix3D+ (augment projection) | 23.23 | 0.579 |
| + SliceFixer (post-processing only) | 26.70 | 0.776 |
| + SliceFixer (with L1 loss) | 26.42 | 0.678 |
| + SliceFixer (with SSIM loss) (Ours) | 28.82 | 0.822 |
Key findings:
- Augmenting with novel-view images (as in RGB NeRF works) is ineffective for volumetric CT.
- Using SliceFixer as a standalone post-processor causes slice jitter (Figure 6c).
- Integrating it into optimization is necessary.
- The 3D SSIM perceptual loss is superior to voxel-wise L1, mitigating overfitting to diffusion hallucinations.
Hyperparameter Analysis (Table 6)
Analysis of 3D SSIM loss weight and supervision frequency :
- achieves the best balance.
- yields optimal performance; more frequent supervision increases cost, less frequent weakens guidance.
Conclusion
- Main Contribution: DiffNR presents a novel and effective framework that enhances neural representation optimization for sparse-view CT by integrating a conditional diffusion prior via a repair-and-augment strategy.
- Key Advantages: Achieves significant improvements in reconstruction quality (PSNR/SSIM), demonstrates strong generalization, and maintains computational efficiency compared to prior diffusion-based methods.
- Broader Impact: The integration of diffusion models with neural representation optimization opens a promising direction for addressing a wider class of inverse problems beyond tomographic reconstruction.
- Future Directions: The method's success suggests potential applications in other 3D imaging modalities and inverse problems where combining explicit 3D representations with powerful 2D priors is beneficial.