# F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

> F4Splat enables feed-forward 3D Gaussian Splatting with explicit Gaussian budget control and spatially adaptive allocation, achieving higher quality with fewer Gaussians than prior methods.

- **Source:** [arXiv](https://arxiv.org/abs/2603.21304)
- **Published:** 2026-03-25
- **Permalink:** https://picx.dev/p/4G14qx
- **Whiteboard:** https://picx.dev/p/4G14qx/image

## Summary

# F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

## Summary (Overview)
*   **Problem:** Existing feed-forward 3D Gaussian Splatting (3DGS) methods inefficiently allocate Gaussians, using uniform (pixel/voxel-based) pipelines that lead to redundant Gaussians across views and lack control over the final Gaussian count.
*   **Core Solution:** F4Splat introduces a **densification-score-guided allocation** strategy that predicts per-region scores to estimate required Gaussian density, enabling spatially adaptive, non-uniform allocation.
*   **Key Features:**
    *   Enables **explicit control** over the final Gaussian budget without retraining.
    *   Reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views.
    *   Produces **compact, high-fidelity** 3D representations from sparse, uncalibrated input images.
*   **Result:** Achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods while using **significantly fewer Gaussians**.

## Introduction and Theoretical Foundation
3D Gaussian Splatting (3DGS) has emerged as a highly efficient method for 3D scene reconstruction and real-time rendering, using explicit 3D Gaussian primitives. A key strength of optimization-based 3DGS is its **Adaptive Density Control (ADC)**, which iteratively adds or removes Gaussians during training to allocate them efficiently based on scene complexity.

However, conventional 3DGS requires costly per-scene iterative optimization and dense input views. **Feed-forward 3DGS** methods address this by learning strong 3D priors from large datasets, enabling single-pass reconstruction from sparse views. Yet, these methods eliminate the iterative ADC process. Most adopt rigid **pixel-to-Gaussian** or **voxel-to-Gaussian** pipelines, which:
1.  **Uniformly allocate** Gaussians, leading to redundancy in simple regions and insufficient detail in complex ones.
2.  **Fix or inflexibly couple** the total number of Gaussians to input resolution or voxel size, preventing explicit budget control.

F4Splat addresses these limitations by reintroducing the concept of adaptive densification into the feed-forward paradigm. It formulates Gaussian densification as a **learnable prediction problem**, where a network estimates a "densification score" for each spatial region, indicating how many Gaussians are needed.

## Methodology
The goal is to develop a feed-forward network $F_\theta$ that, given $N_{ctx}$ uncalibrated context images $\{I^{ctx}_i\}_{i=1}^{N_{ctx}}$ and a user-specified target Gaussian budget $\bar{N}_G$, predicts a set of 3D Gaussians $\mathcal{G} = \{g_g\}_{g=1}^{N_G}$ and camera parameters $\{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}}$:
$$
\{g_g\}_{g=1}^{N_G}, \{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}} = F_\theta(\{I^{ctx}_i\}_{i=1}^{N_{ctx}}, \bar{N}_G)
$$
Each Gaussian $g_g \in \mathbb{R}^{d_G}$ is parameterized by its center $\mu_g$, opacity $\sigma_g$, rotation (quaternion) $q_g$, scale $s_g$, and spherical harmonics $h_g$.

### Network Architecture
The framework has three main components (Fig. 2):
1.  **Geometry Backbone:** Based on VGGT, it uses a pretrained DINOv2 encoder and alternating self-attention layers to extract geometric information and predict camera parameters from the input images.
2.  **Multi-Scale Prediction Heads:** A modified DPT-based decoder with two parallel heads predicts multi-scale Gaussian parameter maps $\{\mathcal{G}^l_i\}_{l=1}^L$ and densification score maps $\{\hat{D}^l_i\}_{l=1}^{L-1}$ from the encoded image tokens.
3.  **Spatially Adaptive Gaussian Allocation:** Uses the predicted densification scores to decide the appropriate representation level for each region via a thresholding rule, enabling non-uniform allocation.

### Spatially Adaptive Gaussian Allocation
The allocation uses binary masks $M^l_{\tau,i} \in \{0, 1\}^{H_l \times W_l}$ to select Gaussians from different scales based on a threshold $\tau$:
$$
M^l_{\tau,i} =
\begin{cases}
\mathbb{1}_{\{\hat{D}^l_i < \tau\}} & \text{if } l = 1, \\
\mathbb{1}_{\{\hat{D}^l_i < \tau\}} \odot \left( \mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) \right) & \text{if } 1 < l < L, \\
\mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) & \text{if } l = L,
\end{cases}
$$
where $\mathbb{1}_{\{\cdot\}}$ is an indicator function, $\odot$ is element-wise product, $\text{Up}(\cdot; 2^k)$ is nearest-neighbor upsampling by factor $2^k$, and $\mathbf{1}$ is a matrix of ones. This ensures Gaussians are selected exclusively across levels.

Given a target budget $\bar{N}_G$, a budget-matching algorithm finds the threshold $\tau_{\bar{N}_G}$ that satisfies:
$$
0 \le \bar{N}_G - N_{G_{\tau_{\bar{N}_G}}} < 4^{L-1} - 1.
$$

### Training Strategy: Feed-Forward Predictive Densification
The densification score must be predictable at inference. Inspired by ADC in optimization-based 3DGS, the model learns to predict scores from **homodirectional view-space positional gradients** derived during training.

For a predicted Gaussian set $\mathcal{G}$, the rendering loss $L_{\text{render}}$ (MSE + LPIPS) between a rendered novel view $\hat{I}^{tgt}$ and ground truth $I^{tgt}$ is computed. The gradient for Gaussian $g_g$ is:
$$
v_g = \left( \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,x}}, \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,y}} \right),
$$
where $(\bar{\mu}_{g,x}, \bar{\mu}_{g,y})$ is the Gaussian's projected 2D center, and $m$ is the number of pixels it affects. A large $||v_g||_2$ indicates the region is underrepresented.

The **supervision signal** for the densification score $\hat{d}_g$ is defined as:
$$
d_g = \log\left(1 + 10^4 \cdot ||v_g||_2\right).
$$
The score loss is:
$$
L^G_{\text{score}} = \mathbb{E}_{g_g \in \mathcal{G}}\left[ ||\hat{d}_g - d_g||_1 \right].
$$

**Novel View Training & Alignment:** To avoid overfitting to context views, the model is supervised using **novel target views**. Since the predicted camera coordinate system differs from ground truth, a similarity transformation matrix $A \in \text{Sim}(3)$ is estimated to align the ground-truth target pose $T^{tgt}$ to the predicted frame: $\hat{T}^{tgt} = A T^{tgt}$.

**Total Loss:** The overall training objective is a weighted sum:
$$
L_{\text{total}} = L_{\text{render}} + L_{\text{score}} + L_{\text{camera}} + L_{\text{scene}},
$$
where $L_{\text{scene}}$ is a scene-scale regularization that normalizes the average distance of Gaussian centers from the origin to 1.

## Empirical Validation / Results
Models were trained on RealEstate10K (RE10K) and ACID datasets and evaluated against state-of-the-art pose-free and uncalibrated feed-forward 3DGS methods.

### Quantitative Results
**Multi-View Evaluation (RE10K):** F4Splat achieves competitive or superior performance while using far fewer Gaussians. With a high threshold ($\tau^+$, using ~20-30% of baseline Gaussians), it matches or beats baselines. With a low threshold ($\tau^-$, using a similar number), it consistently outperforms them.

**Table 1: Novel view synthesis performance on RE10K under different numbers of input views.**
| Method | 8 views | 16 views | 24 views |
| :--- | :--- | :--- | :--- |
| | #GS ↓ | LPIPS ↓ SSIM ↑ PSNR ↑ | #GS ↓ | LPIPS ↓ SSIM ↑ PSNR ↑ | #GS ↓ | LPIPS ↓ SSIM ↑ PSNR ↑ |
| **Uncalibrated Baselines** | | | | | | |
| VicaSplat | 524K | 0.258 0.686 20.77 | 1049K | 0.417 0.556 16.78 | 1573K | 0.470 0.517 15.58 |
| AnySplat | 447K | 0.167 0.819 24.07 | 820K | 0.148 0.842 25.10 | 1142K | 0.143 0.849 25.40 |
| **F4Splat $\tau^+$** | **105K** | **0.142** 0.847 **25.26** | **210K** | **0.130** 0.860 **25.75** | **315K** | **0.128** 0.862 **25.85** |
| **F4Splat $\tau^-$** | 447K | **0.131** **0.859** **25.64** | 820K | **0.120** **0.869** **26.10** | 1142K | **0.119** **0.870** **26.18** |

**Generalization to Unseen Data (ACID):** F4Splat generalizes well, maintaining strong performance on the unseen ACID dataset.

**Table 2: Generalization to unseen datasets (ACID).**
| Method | 8 views | 16 views | 24 views |
| :--- | :--- | :--- | :--- |
| **Uncalibrated Baselines** | | | | |
| AnySplat | 481K | 0.248 0.696 23.30 | 906K | 0.236 0.720 23.88 | 1289K | 0.234 0.727 24.04 |
| **F4Splat $\tau^+$** | **52K** | **0.239** 0.713 **24.28** | **105K** | **0.230** 0.726 **24.54** | **315K** | **0.216** 0.741 **24.72** |
| **F4Splat $\tau^-$** | 481K | **0.204** **0.744** **24.83** | 906K | **0.201** **0.753** **25.01** | 1289K | **0.203** 0.752 24.88 |

**Two-View Evaluation (ACID):** F4Splat achieves superior performance among uncalibrated methods and remains competitive with pose-required and pose-free approaches.

**Table 3: Novel view synthesis performance under 2-view setting on ACID.**
| Method | Avg. #GS ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
| :--- | :--- | :--- | :--- | :--- |
| **Uncalibrated** | | | | |
| VicaSplat | 131K | 0.218 | 0.726 | 24.548 |
| **F4Splat $\tau^+$** | **52K** | **0.188** | **0.784** | **26.028** |
| **F4Splat $\tau^-$** | 131K | **0.176** | **0.794** | **26.282** |

### Qualitative Results
As shown in Fig. 6, F4Splat produces sharper details and more faithful reconstructions than baselines, even when using substantially fewer Gaussians (e.g., 24-29% of the baseline count).

### Ablation Studies
Key components of F4Splat are validated through ablations (24 views, fixed 20% Gaussian budget):

**Table 4: Ablation Studies.**
| Variant | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
| :--- | :--- | :--- | :--- |
| (a) Random-based allocation | 0.194 | 0.828 | 24.68 |
| (b) Frequency-based allocation | 0.160 | 0.841 | 25.36 |
| (c) w/o level-wise GS train | 0.192 | 0.813 | 24.25 |
| (d) w/o scene scale reg. | 0.712 | 0.006 | 4.82 |
| **(e) Ours (Full)** | **0.143** | **0.854** | **25.47** |

*   **(a) & (b):** Replacing the learned densification score with random or simple frequency-based heuristics significantly degrades performance, proving the learned score's effectiveness.
*   **(c):** Removing level-wise Gaussian supervision during training hurts performance, confirming its necessity for stable optimization.
*   **(d):** Removing the scene-scale regularization causes training to fail, highlighting its critical role for stability in the uncalibrated setting.

## Theoretical and Practical Implications
*   **Theoretical:** F4Splat successfully bridges a key gap between optimization-based and feed-forward 3DGS by making **adaptive density control a learnable, feed-forward prediction**. It demonstrates that gradient-based densification signals can be effectively distilled into a network.
*   **Practical:** The method provides:
    1.  **Explicit Budget Control:** Users can directly specify the desired number of Gaussians for a scene, enabling trade-offs between quality and storage/rendering cost without retraining.
    2.  **Compact, High-Quality Representations:** By allocating Gaussians efficiently—concentrating on complex details and avoiding redundancy—it achieves superior quality-per-Gaussian, reducing memory footprint and potentially accelerating rendering.
    3.  **Robustness:** It works with sparse, uncalibrated input images, making it applicable to real-world scenarios where camera parameters are unknown.

## Conclusion
F4Splat introduces a feed-forward predictive densification framework for 3D Gaussian Splatting. Its core innovation is a **densification-score-guided allocation strategy** that enables spatially adaptive Gaussian distribution from sparse, uncalibrated inputs. This allows explicit control over the Gaussian budget and produces compact yet high-fidelity 3D representations. Extensive experiments show F4Splat achieves state-of-the-art or competitive novel-view synthesis quality while using significantly fewer Gaussians than prior methods, validating the effectiveness of its adaptive allocation approach for efficient feed-forward 3DGS.

---

_Markdown view of https://picx.dev/p/4G14qx, served by PicX — AI-generated visual whiteboard summaries of research papers._