# DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

> DiffNR integrates a diffusion model to periodically repair artifacts in neural 3D representations, significantly improving sparse-view CT reconstruction quality and efficiency.

- **Source:** [arXiv](https://arxiv.org/abs/2604.21518)
- **Published:** 2026-04-28
- **Permalink:** https://picx.dev/p/18w6ac
- **Whiteboard:** https://picx.dev/p/18w6ac/image

## Summary

# Summary of "DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction"

## Summary (Overview)
*   **Novel Framework**: Proposes **DiffNR**, a novel optimization framework that enhances **Neural Representations (NRs)** (e.g., neural fields, 3D Gaussians) for sparse-view CT reconstruction by integrating a conditional diffusion prior in a "repair-and-augment" strategy.
*   **Core Component**: Introduces **SliceFixer**, a single-step diffusion model finetuned to correct artifacts in degraded 2D CT slices queried from NRs, conditioned on biplanar X-ray projections and a text prompt.
*   **Efficient Integration**: The diffusion model is not queried at every optimization step. Instead, it periodically generates **pseudo-reference volumes** that provide auxiliary 3D perceptual supervision, avoiding the computational overhead of iterative denoising-based methods.
*   **Superior Performance**: Achieves significant improvements over baselines (e.g., +5.79 dB PSNR for R²-Gaussian), demonstrates strong generalization to out-of-distribution (OOD) data, and maintains reasonable runtime compared to prior diffusion-based methods.
*   **Practical Validation**: Shows improved performance on downstream tasks like lung segmentation, highlighting the method's practical utility.

## Introduction and Theoretical Foundation
Sparse-view CT (SVCT) reconstruction aims to recover high-quality 3D volumes from few projections to reduce radiation exposure. Existing optimization-based methods fall into two categories:
1.  **Neural Representation (NR) Methods**: Model the volume as a learnable 3D field (e.g., neural fields like NAF, or 3D Gaussians like R²-Gaussian) and optimize via differentiable rendering. They are efficient but suffer from severe artifacts in underconstrained regions under sparse views.
2.  **Neural Prior (NP) Methods**: Use pretrained networks (e.g., unconditional 2D diffusion models) as data-driven priors and embed solvers into iterative denoising. While effective, they suffer from inter-slice jitters, hallucinations, and long processing times.

**Motivation**: The paper proposes to **marry neural representations with diffusion models** to get the best of both worlds: the volumetric consistency of a unified 3D representation and the powerful prior of pretrained 2D foundation models. The key challenge is developing an NR-aware diffusion model and integrating it efficiently.

**Theoretical Foundation**: The work is built upon:
*   **X-ray Imaging Physics (Beer-Lambert Law)**: For a ray $r(s) = o + sd$, the logarithmic projection value is:
    $$ I(r) = \int_{s_n}^{s_f} \sigma(r(s)) \, ds $$
    where $\sigma(v)$ is the density field to be recovered.
*   **Neural Representations**:
    *   **Neural Fields (NAF)**: An MLP $f$ outputs density $\sigma_f(v)$. Rendering uses a discrete integral: $I_f(r) = \sum_{i=1}^{P} \sigma_f(r(s_i)) \cdot (r(s_{i+1}) - r(s_i))$.
    *   **3D Gaussians (R²-Gaussian)**: Density is a mixture: $\sigma_g(v) = \sum_{i=1}^{M} G_i^3(v)$, where each Gaussian is defined as:
        $$ G_i^3(v) = \rho_i \exp\left(-\frac{1}{2}(v - p_i)^\top \Sigma_i^{-1} (v - p_i)\right) $$
*   **Diffusion Models**: The paper utilizes **single-step diffusion models (SD-Turbo)**, which distill the multi-step denoising process for fast inference. The training objective is score matching:
    $$ \mathbb{E}_{x \sim p_{\text{data}}, t \sim p_t, \epsilon \sim \mathcal{N}(0,1)} \left[ \lVert \epsilon - \epsilon_\theta(x_t; c, t) \rVert_2^2 \right] $$
    where $c$ denotes conditioning information.

## Methodology
The **DiffNR** framework consists of three main components:

### 1. SliceFixer: Diffusion Model for Slice Repairing
*   **Purpose**: Correct artifacts in degraded axial slices $\tilde{S}$ queried from NRs, outputting refined slices $\hat{S}$.
*   **Architecture**: Built upon **SD-Turbo**. A VAE encodes the corrupted slice, a U-Net predicts target latents, and the decoder reconstructs the refined slice.
*   **Conditioning**: To preserve anatomical structures, the model is conditioned on:
    *   **Biplanar X-ray projections** $(I_a, I_b)$: Provide global structural cues, encoded using RAD-DINO.
    *   **Text prompt** $c_t$: Provides high-level semantic guidance (e.g., "Remove artifacts for this [Organ] CT slice.").
    The combined conditioning is $c = \text{Embed}(I_a, I_b, c_t)$.
*   **Finetuning**: The pretrained SD-Turbo is adapted using:
    *   **LoRA adapters** injected into VAE and U-Net.
    *   **Zero-convolution skip connections** between encoder and decoder.
*   **Training Loss**: The total objective for finetuning SliceFixer is:
    $$ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{L2}} + \mathcal{L}_{\text{LPIPS}} + \lambda_{\text{CLIP}} \mathcal{L}_{\text{CLIP}} + \lambda_{\text{GAN}} \mathcal{L}_{\text{GAN}} + \lambda_{\text{SSIM}} \mathcal{L}_{\text{SSIM}} $$

### 2. Data Curation for SliceFixer Training
Since no paired dataset exists, one is synthesized:
*   **View Distribution**: Simulate sparse-view scenarios by sampling subsets (uniform/non-uniform) from dense synthetic projections, introducing diverse artifact patterns.
*   **Model Underfitting**: Intentionally limit NR training iterations to produce volumes with pronounced artifacts.
*   **Mixed Neural Representation**: Use reconstructions from both neural fields and 3D Gaussians (1:1 ratio) to encourage generalization.

### 3. DiffNR Optimization Pipeline (Algorithm 1)
The core "repair-and-augment" strategy integrates SliceFixer into NR optimization (see Figure 3):
1.  **Stage 1 (Standard NR Optimization)**: Optimize the NR (NAF or R²-Gaussian) using:
    *   **Image losses** (L1 and SSIM) between rendered projections $\tilde{I}_i$ and measured projections $I_i$.
    *   **Low-level 3D regularization** (Total Variation on a queried sub-volume $\tilde{V}_{tv}$).
2.  **Stage 2 (Diffusion-Enhanced Augmentation)**:
    *   Every $\ell$ iterations, query a volume $\tilde{V}_\ell$ from the current NR.
    *   **Upsample** each slice, apply **SliceFixer** for artifact correction, and **downsample** back to original resolution, forming a pseudo-reference volume $\hat{V}_\ell$.
    *   Every $\tau$ steps, compute a **3D perceptual supervision loss** between the currently queried volume $\tilde{V}$ and $\hat{V}_\ell$.
    *   The loss is a **3D SSIM** (average of 2D SSIM across axial, sagittal, coronal planes), weighted by $\lambda_{\text{diff}}$. This promotes structural coherence over voxel-wise fitting to potentially hallucinated details.

This strategy provides auxiliary 3D supervision to fix underconstrained regions while avoiding frequent, expensive diffusion model queries.

## Empirical Validation / Results
Experiments were conducted on **ToothFairy** (dental) and **LUNA16** (chest) datasets, with sparse-view settings of 36, 24, and 12 projections.

### Quantitative Results (In-Distribution)
**Table 1** shows comprehensive comparisons:

| Methods | ToothFairy (36-view) | LUNA16 (36-view) | TIME |
| :--- | :--- | :--- | :--- |
| **Traditional** | | | |
| SART | 27.41 / 0.581 | 22.34 / 0.438 | 1m25s |
| ASD-POCS | 29.65 / 0.775 | 23.93 / 0.661 | 48s |
| **Diffusion-Based Iterative** | | | |
| DiffusionMBIR | 33.29 / **0.856** | 29.35 / 0.781 | 11h15m |
| DDS | 32.56 / 0.817 | 26.21 / 0.554 | 16m17s |
| **Neural Representation** | | | |
| SAX-NeRF | 28.48 / 0.835 | 23.72 / 0.704 | 4h9m |
| NAF | 28.62 / 0.833 | 23.85 / 0.712 | 7m15s |
| **+ DiffNR (Ours)** | **31.27** / **0.951** | **26.27** / **0.867** | 8m41s |
| R²-Gaussian | 28.56 / 0.695 | 24.11 / 0.577 | 5m52s |
| **+ DiffNR (Ours)** | **33.52** / **0.900** | **28.82** / **0.822** | 11m35s |

*   **Key Findings**:
    *   DiffNR consistently enhances NR baselines: **+2.19 dB average PSNR for NAF**, **+5.79 dB for R²-Gaussian**.
    *   It outperforms prior diffusion-based SOTA (DiffusionMBIR) in quality on LUNA16 while being **orders of magnitude faster** (minutes vs. hours).
    *   The SSIM gains are particularly significant, indicating superior structural recovery.

### Out-of-Distribution (OOD) Generalization
**Table 2** shows results on a diverse OOD dataset (human organs, specimens, artificial objects) using SliceFixer trained only on ToothFairy:

| Methods | 36-view PSNR / SSIM |
| :--- | :--- |
| R²-Gaussian | 35.64 / 0.904 |
| **+ DiffNR (Ours)** | **35.99 / 0.918** |
| DiffusionMBIR | 33.26 / 0.839 |

DiffNR outperforms others, suppressing hallucinations and artifacts, demonstrating that SliceFixer learns **generalizable artifact patterns**.

### Downstream Application: Lung Segmentation
**Table 3** validates utility on a medical task (lung segmentation on LUNA16 volumes):

| Methods | 36-view Dice ↑ / ASD ↓ |
| :--- | :--- |
| R²-Gaussian | 90.41 / 5.19 |
| **+ DiffNR (Ours)** | **93.74 / 3.85** |
| DiffusionMBIR | 90.33 / 6.13 |

DiffNR produces volumes that lead to segmentation masks more consistent with ground truth (higher Dice, lower Average Surface Distance).

### Qualitative Results
**Figure 4 & 5** visually demonstrate that DiffNR recovers finer anatomical details and effectively suppresses streaking and blurring artifacts present in baseline NR and other methods, across various sparsity levels and datasets.

## Ablation Studies

### SliceFixer Design (Table 4)
Ablation on LUNA16 36-view case with R²-Gaussian backbone:

| ID | Resolution | SD-Turbo Pretrain | $\mathcal{L}_{\text{ssim}}$ | Bip. Proj. | PSNR | SSIM |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| (1) | 256 | ✓ | | | 27.65 | 0.789 |
| (2) | 512 | ✓ | | | 27.91 | 0.807 |
| (3) | 512 | ✓ | ✓ | | 28.21 | 0.814 |
| **(4)** | **512** | **✓** | **✓** | **✓** | **28.82** | **0.822** |

Key findings:
*   Finetuning on **512²** images with up/downsampling is better than native 256².
*   Adding **SSIM loss** ($\mathcal{L}_{\text{ssim}}$) gives a +0.3 dB gain.
*   **Biplanar projection conditioning** provides the largest boost (+0.6 dB).

### DiffNR Design (Table 5)
| Methods | PSNR | SSIM |
| :--- | :--- | :--- |
| R²-Gaussian | 24.11 | 0.577 |
| + Difix3D+ (augment projection) | 23.23 | 0.579 |
| + SliceFixer (post-processing only) | 26.70 | 0.776 |
| + SliceFixer (with L1 loss) | 26.42 | 0.678 |
| **+ SliceFixer (with SSIM loss) (Ours)** | **28.82** | **0.822** |

Key findings:
*   Augmenting with **novel-view images** (as in RGB NeRF works) is ineffective for volumetric CT.
*   Using SliceFixer as a **standalone post-processor** causes slice jitter (Figure 6c).
*   **Integrating it into optimization** is necessary.
*   The **3D SSIM perceptual loss** is superior to voxel-wise **L1**, mitigating overfitting to diffusion hallucinations.

### Hyperparameter Analysis (Table 6)
Analysis of 3D SSIM loss weight $\lambda_{\text{diff}}$ and supervision frequency $\tau$:
*   $\lambda_{\text{diff}} = 0.5$ achieves the best balance.
*   $\tau = 10$ yields optimal performance; more frequent supervision increases cost, less frequent weakens guidance.

## Conclusion
*   **Main Contribution**: DiffNR presents a novel and effective framework that enhances neural representation optimization for sparse-view CT by integrating a conditional diffusion prior via a repair-and-augment strategy.
*   **Key Advantages**: Achieves significant improvements in reconstruction quality (PSNR/SSIM), demonstrates strong generalization, and maintains computational efficiency compared to prior diffusion-based methods.
*   **Broader Impact**: The integration of diffusion models with neural representation optimization opens a promising direction for addressing a wider class of inverse problems beyond tomographic reconstruction.
*   **Future Directions**: The method's success suggests potential applications in other 3D imaging modalities and inverse problems where combining explicit 3D representations with powerful 2D priors is beneficial.

---

_Markdown view of https://picx.dev/p/18w6ac, served by PicX — AI-generated visual whiteboard summaries of research papers._