F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

Summary (Overview)

Problem: Existing feed-forward 3D Gaussian Splatting (3DGS) methods inefficiently allocate Gaussians, using uniform (pixel/voxel-based) pipelines that lead to redundant Gaussians across views and lack control over the final Gaussian count.
Core Solution: F4Splat introduces a densification-score-guided allocation strategy that predicts per-region scores to estimate required Gaussian density, enabling spatially adaptive, non-uniform allocation.
Key Features:
- Enables explicit control over the final Gaussian budget without retraining.
- Reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views.
- Produces compact, high-fidelity 3D representations from sparse, uncalibrated input images.
Result: Achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods while using significantly fewer Gaussians.

Introduction and Theoretical Foundation

3D Gaussian Splatting (3DGS) has emerged as a highly efficient method for 3D scene reconstruction and real-time rendering, using explicit 3D Gaussian primitives. A key strength of optimization-based 3DGS is its Adaptive Density Control (ADC), which iteratively adds or removes Gaussians during training to allocate them efficiently based on scene complexity.

However, conventional 3DGS requires costly per-scene iterative optimization and dense input views. Feed-forward 3DGS methods address this by learning strong 3D priors from large datasets, enabling single-pass reconstruction from sparse views. Yet, these methods eliminate the iterative ADC process. Most adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines, which:

Uniformly allocate Gaussians, leading to redundancy in simple regions and insufficient detail in complex ones.
Fix or inflexibly couple the total number of Gaussians to input resolution or voxel size, preventing explicit budget control.

F4Splat addresses these limitations by reintroducing the concept of adaptive densification into the feed-forward paradigm. It formulates Gaussian densification as a learnable prediction problem, where a network estimates a "densification score" for each spatial region, indicating how many Gaussians are needed.

Methodology

The goal is to develop a feed-forward network $F_\theta$ that, given $N_{ctx}$ uncalibrated context images $\{I^{ctx}_i\}_{i=1}^{N_{ctx}}$ and a user-specified target Gaussian budget $\bar{N}_G$ , predicts a set of 3D Gaussians $\mathcal{G} = \{g_g\}_{g=1}^{N_G}$ and camera parameters $\{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}}$ :

\{g_g\}_{g=1}^{N_G}, \{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}} = F_\theta(\{I^{ctx}_i\}_{i=1}^{N_{ctx}}, \bar{N}_G)

Each Gaussian $g_g \in \mathbb{R}^{d_G}$ is parameterized by its center $\mu_g$ , opacity $\sigma_g$ , rotation (quaternion) $q_g$ , scale $s_g$ , and spherical harmonics $h_g$ .

Network Architecture

The framework has three main components (Fig. 2):

Geometry Backbone: Based on VGGT, it uses a pretrained DINOv2 encoder and alternating self-attention layers to extract geometric information and predict camera parameters from the input images.
Multi-Scale Prediction Heads: A modified DPT-based decoder with two parallel heads predicts multi-scale Gaussian parameter maps $\{\mathcal{G}^l_i\}_{l=1}^L$ and densification score maps $\{\hat{D}^l_i\}_{l=1}^{L-1}$ from the encoded image tokens.
Spatially Adaptive Gaussian Allocation: Uses the predicted densification scores to decide the appropriate representation level for each region via a thresholding rule, enabling non-uniform allocation.

Spatially Adaptive Gaussian Allocation

The allocation uses binary masks $M^l_{\tau,i} \in \{0, 1\}^{H_l \times W_l}$ to select Gaussians from different scales based on a threshold $\tau$ :

M^l_{\tau,i} = \begin{cases} \mathbb{1}_{\{\hat{D}^l_i < \tau\}} & \text{if } l = 1, \\ \mathbb{1}_{\{\hat{D}^l_i < \tau\}} \odot \left( \mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) \right) & \text{if } 1 < l < L, \\ \mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) & \text{if } l = L, \end{cases}

where $\mathbb{1}_{\{\cdot\}}$ is an indicator function, $\odot$ is element-wise product, $\text{Up}(\cdot; 2^k)$ is nearest-neighbor upsampling by factor $2^k$ , and $\mathbf{1}$ is a matrix of ones. This ensures Gaussians are selected exclusively across levels.

Given a target budget $\bar{N}_G$ , a budget-matching algorithm finds the threshold $\tau_{\bar{N}_G}$ that satisfies:

0 \le \bar{N}_G - N_{G_{\tau_{\bar{N}_G}}} < 4^{L-1} - 1.

Training Strategy: Feed-Forward Predictive Densification

The densification score must be predictable at inference. Inspired by ADC in optimization-based 3DGS, the model learns to predict scores from homodirectional view-space positional gradients derived during training.

For a predicted Gaussian set $\mathcal{G}$ , the rendering loss $L_{\text{render}}$ (MSE + LPIPS) between a rendered novel view $\hat{I}^{tgt}$ and ground truth $I^{tgt}$ is computed. The gradient for Gaussian $g_g$ is:

v_g = \left( \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,x}}, \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,y}} \right),

where $(\bar{\mu}_{g,x}, \bar{\mu}_{g,y})$ is the Gaussian's projected 2D center, and $m$ is the number of pixels it affects. A large $||v_g||_2$ indicates the region is underrepresented.

The supervision signal for the densification score $\hat{d}_g$ is defined as:

d_g = \log\left(1 + 10^4 \cdot ||v_g||_2\right).

The score loss is:

L^G_{\text{score}} = \mathbb{E}_{g_g \in \mathcal{G}}\left[ ||\hat{d}_g - d_g||_1 \right].

Novel View Training & Alignment: To avoid overfitting to context views, the model is supervised using novel target views. Since the predicted camera coordinate system differs from ground truth, a similarity transformation matrix $A \in \text{Sim}(3)$ is estimated to align the ground-truth target pose $T^{tgt}$ to the predicted frame: $\hat{T}^{tgt} = A T^{tgt}$ .

Total Loss: The overall training objective is a weighted sum:

L_{\text{total}} = L_{\text{render}} + L_{\text{score}} + L_{\text{camera}} + L_{\text{scene}},

where $L_{\text{scene}}$ is a scene-scale regularization that normalizes the average distance of Gaussian centers from the origin to 1.

Empirical Validation / Results

Models were trained on RealEstate10K (RE10K) and ACID datasets and evaluated against state-of-the-art pose-free and uncalibrated feed-forward 3DGS methods.

Quantitative Results

Multi-View Evaluation (RE10K): F4Splat achieves competitive or superior performance while using far fewer Gaussians. With a high threshold ( $\tau^+$ , using ~20-30% of baseline Gaussians), it matches or beats baselines. With a low threshold ( $\tau^-$ , using a similar number), it consistently outperforms them.

Table 1: Novel view synthesis performance on RE10K under different numbers of input views.

Method	8 views	16 views	24 views
	#GS ↓	LPIPS ↓ SSIM ↑ PSNR ↑	#GS ↓
Uncalibrated Baselines
VicaSplat	524K	0.258 0.686 20.77	1049K
AnySplat	447K	0.167 0.819 24.07	820K
F4Splat $\tau^+$	105K	0.142 0.847 25.26	210K
F4Splat $\tau^-$	447K	0.131 0.859 25.64	820K

Generalization to Unseen Data (ACID): F4Splat generalizes well, maintaining strong performance on the unseen ACID dataset.

Table 2: Generalization to unseen datasets (ACID).

Method	8 views	16 views	24 views
Uncalibrated Baselines
AnySplat	481K	0.248 0.696 23.30	906K
F4Splat $\tau^+$	52K	0.239 0.713 24.28	105K
F4Splat $\tau^-$	481K	0.204 0.744 24.83	906K

Two-View Evaluation (ACID): F4Splat achieves superior performance among uncalibrated methods and remains competitive with pose-required and pose-free approaches.

Table 3: Novel view synthesis performance under 2-view setting on ACID.

Method	Avg. #GS ↓	LPIPS ↓	SSIM ↑	PSNR ↑
Uncalibrated
VicaSplat	131K	0.218	0.726	24.548
F4Splat $\tau^+$	52K	0.188	0.784	26.028
F4Splat $\tau^-$	131K	0.176	0.794	26.282

Qualitative Results

As shown in Fig. 6, F4Splat produces sharper details and more faithful reconstructions than baselines, even when using substantially fewer Gaussians (e.g., 24-29% of the baseline count).

Ablation Studies

Key components of F4Splat are validated through ablations (24 views, fixed 20% Gaussian budget):

Table 4: Ablation Studies.

Variant	LPIPS ↓	SSIM ↑	PSNR ↑
(a) Random-based allocation	0.194	0.828	24.68
(b) Frequency-based allocation	0.160	0.841	25.36
(c) w/o level-wise GS train	0.192	0.813	24.25
(d) w/o scene scale reg.	0.712	0.006	4.82
(e) Ours (Full)	0.143	0.854	25.47

(a) & (b): Replacing the learned densification score with random or simple frequency-based heuristics significantly degrades performance, proving the learned score's effectiveness.
(c): Removing level-wise Gaussian supervision during training hurts performance, confirming its necessity for stable optimization.
(d): Removing the scene-scale regularization causes training to fail, highlighting its critical role for stability in the uncalibrated setting.

Theoretical and Practical Implications

Theoretical: F4Splat successfully bridges a key gap between optimization-based and feed-forward 3DGS by making adaptive density control a learnable, feed-forward prediction. It demonstrates that gradient-based densification signals can be effectively distilled into a network.
Practical: The method provides:
1. Explicit Budget Control: Users can directly specify the desired number of Gaussians for a scene, enabling trade-offs between quality and storage/rendering cost without retraining.
2. Compact, High-Quality Representations: By allocating Gaussians efficiently—concentrating on complex details and avoiding redundancy—it achieves superior quality-per-Gaussian, reducing memory footprint and potentially accelerating rendering.
3. Robustness: It works with sparse, uncalibrated input images, making it applicable to real-world scenarios where camera parameters are unknown.

Conclusion

F4Splat introduces a feed-forward predictive densification framework for 3D Gaussian Splatting. Its core innovation is a densification-score-guided allocation strategy that enables spatially adaptive Gaussian distribution from sparse, uncalibrated inputs. This allows explicit control over the Gaussian budget and produces compact yet high-fidelity 3D representations. Extensive experiments show F4Splat achieves state-of-the-art or competitive novel-view synthesis quality while using significantly fewer Gaussians than prior methods, validating the effectiveness of its adaptive allocation approach for efficient feed-forward 3DGS.