F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting

Summary (Overview)

  • Problem: Existing feed-forward 3D Gaussian Splatting (3DGS) methods inefficiently allocate Gaussians, using uniform (pixel/voxel-based) pipelines that lead to redundant Gaussians across views and lack control over the final Gaussian count.
  • Core Solution: F4Splat introduces a densification-score-guided allocation strategy that predicts per-region scores to estimate required Gaussian density, enabling spatially adaptive, non-uniform allocation.
  • Key Features:
    • Enables explicit control over the final Gaussian budget without retraining.
    • Reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views.
    • Produces compact, high-fidelity 3D representations from sparse, uncalibrated input images.
  • Result: Achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods while using significantly fewer Gaussians.

Introduction and Theoretical Foundation

3D Gaussian Splatting (3DGS) has emerged as a highly efficient method for 3D scene reconstruction and real-time rendering, using explicit 3D Gaussian primitives. A key strength of optimization-based 3DGS is its Adaptive Density Control (ADC), which iteratively adds or removes Gaussians during training to allocate them efficiently based on scene complexity.

However, conventional 3DGS requires costly per-scene iterative optimization and dense input views. Feed-forward 3DGS methods address this by learning strong 3D priors from large datasets, enabling single-pass reconstruction from sparse views. Yet, these methods eliminate the iterative ADC process. Most adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines, which:

  1. Uniformly allocate Gaussians, leading to redundancy in simple regions and insufficient detail in complex ones.
  2. Fix or inflexibly couple the total number of Gaussians to input resolution or voxel size, preventing explicit budget control.

F4Splat addresses these limitations by reintroducing the concept of adaptive densification into the feed-forward paradigm. It formulates Gaussian densification as a learnable prediction problem, where a network estimates a "densification score" for each spatial region, indicating how many Gaussians are needed.

Methodology

The goal is to develop a feed-forward network FθF_\theta that, given NctxN_{ctx} uncalibrated context images {Iictx}i=1Nctx\{I^{ctx}_i\}_{i=1}^{N_{ctx}} and a user-specified target Gaussian budget NˉG\bar{N}_G, predicts a set of 3D Gaussians G={gg}g=1NG\mathcal{G} = \{g_g\}_{g=1}^{N_G} and camera parameters {P^ictx}i=1Nctx\{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}}:

{gg}g=1NG,{P^ictx}i=1Nctx=Fθ({Iictx}i=1Nctx,NˉG)\{g_g\}_{g=1}^{N_G}, \{\hat{P}^{ctx}_i\}_{i=1}^{N_{ctx}} = F_\theta(\{I^{ctx}_i\}_{i=1}^{N_{ctx}}, \bar{N}_G)

Each Gaussian ggRdGg_g \in \mathbb{R}^{d_G} is parameterized by its center μg\mu_g, opacity σg\sigma_g, rotation (quaternion) qgq_g, scale sgs_g, and spherical harmonics hgh_g.

Network Architecture

The framework has three main components (Fig. 2):

  1. Geometry Backbone: Based on VGGT, it uses a pretrained DINOv2 encoder and alternating self-attention layers to extract geometric information and predict camera parameters from the input images.
  2. Multi-Scale Prediction Heads: A modified DPT-based decoder with two parallel heads predicts multi-scale Gaussian parameter maps {Gil}l=1L\{\mathcal{G}^l_i\}_{l=1}^L and densification score maps {D^il}l=1L1\{\hat{D}^l_i\}_{l=1}^{L-1} from the encoded image tokens.
  3. Spatially Adaptive Gaussian Allocation: Uses the predicted densification scores to decide the appropriate representation level for each region via a thresholding rule, enabling non-uniform allocation.

Spatially Adaptive Gaussian Allocation

The allocation uses binary masks Mτ,il{0,1}Hl×WlM^l_{\tau,i} \in \{0, 1\}^{H_l \times W_l} to select Gaussians from different scales based on a threshold τ\tau:

Mτ,il={1{D^il<τ}if l=1,1{D^il<τ}(1k=1l1Up(Mτ,ilk;2k))if 1<l<L,1k=1l1Up(Mτ,ilk;2k)if l=L,M^l_{\tau,i} = \begin{cases} \mathbb{1}_{\{\hat{D}^l_i < \tau\}} & \text{if } l = 1, \\ \mathbb{1}_{\{\hat{D}^l_i < \tau\}} \odot \left( \mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) \right) & \text{if } 1 < l < L, \\ \mathbf{1} - \sum_{k=1}^{l-1} \text{Up}(M^{l-k}_{\tau,i}; 2^k) & \text{if } l = L, \end{cases}

where 1{}\mathbb{1}_{\{\cdot\}} is an indicator function, \odot is element-wise product, Up(;2k)\text{Up}(\cdot; 2^k) is nearest-neighbor upsampling by factor 2k2^k, and 1\mathbf{1} is a matrix of ones. This ensures Gaussians are selected exclusively across levels.

Given a target budget NˉG\bar{N}_G, a budget-matching algorithm finds the threshold τNˉG\tau_{\bar{N}_G} that satisfies:

0NˉGNGτNˉG<4L11.0 \le \bar{N}_G - N_{G_{\tau_{\bar{N}_G}}} < 4^{L-1} - 1.

Training Strategy: Feed-Forward Predictive Densification

The densification score must be predictable at inference. Inspired by ADC in optimization-based 3DGS, the model learns to predict scores from homodirectional view-space positional gradients derived during training.

For a predicted Gaussian set G\mathcal{G}, the rendering loss LrenderL_{\text{render}} (MSE + LPIPS) between a rendered novel view I^tgt\hat{I}^{tgt} and ground truth ItgtI^{tgt} is computed. The gradient for Gaussian ggg_g is:

vg=(j=1mLrenderjμˉg,x,j=1mLrenderjμˉg,y),v_g = \left( \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,x}}, \sum_{j=1}^m \frac{\partial L_{\text{render}}^j}{\partial \bar{\mu}_{g,y}} \right),

where (μˉg,x,μˉg,y)(\bar{\mu}_{g,x}, \bar{\mu}_{g,y}) is the Gaussian's projected 2D center, and mm is the number of pixels it affects. A large vg2||v_g||_2 indicates the region is underrepresented.

The supervision signal for the densification score d^g\hat{d}_g is defined as:

dg=log(1+104vg2).d_g = \log\left(1 + 10^4 \cdot ||v_g||_2\right).

The score loss is:

LscoreG=EggG[d^gdg1].L^G_{\text{score}} = \mathbb{E}_{g_g \in \mathcal{G}}\left[ ||\hat{d}_g - d_g||_1 \right].

Novel View Training & Alignment: To avoid overfitting to context views, the model is supervised using novel target views. Since the predicted camera coordinate system differs from ground truth, a similarity transformation matrix ASim(3)A \in \text{Sim}(3) is estimated to align the ground-truth target pose TtgtT^{tgt} to the predicted frame: T^tgt=ATtgt\hat{T}^{tgt} = A T^{tgt}.

Total Loss: The overall training objective is a weighted sum:

Ltotal=Lrender+Lscore+Lcamera+Lscene,L_{\text{total}} = L_{\text{render}} + L_{\text{score}} + L_{\text{camera}} + L_{\text{scene}},

where LsceneL_{\text{scene}} is a scene-scale regularization that normalizes the average distance of Gaussian centers from the origin to 1.

Empirical Validation / Results

Models were trained on RealEstate10K (RE10K) and ACID datasets and evaluated against state-of-the-art pose-free and uncalibrated feed-forward 3DGS methods.

Quantitative Results

Multi-View Evaluation (RE10K): F4Splat achieves competitive or superior performance while using far fewer Gaussians. With a high threshold (τ+\tau^+, using ~20-30% of baseline Gaussians), it matches or beats baselines. With a low threshold (τ\tau^-, using a similar number), it consistently outperforms them.

Table 1: Novel view synthesis performance on RE10K under different numbers of input views.

Method8 views16 views24 views
#GS ↓LPIPS ↓ SSIM ↑ PSNR ↑#GS ↓
Uncalibrated Baselines
VicaSplat524K0.258 0.686 20.771049K
AnySplat447K0.167 0.819 24.07820K
F4Splat τ+\tau^+105K0.142 0.847 25.26210K
F4Splat τ\tau^-447K0.131 0.859 25.64820K

Generalization to Unseen Data (ACID): F4Splat generalizes well, maintaining strong performance on the unseen ACID dataset.

Table 2: Generalization to unseen datasets (ACID).

Method8 views16 views24 views
Uncalibrated Baselines
AnySplat481K0.248 0.696 23.30906K
F4Splat τ+\tau^+52K0.239 0.713 24.28105K
F4Splat τ\tau^-481K0.204 0.744 24.83906K

Two-View Evaluation (ACID): F4Splat achieves superior performance among uncalibrated methods and remains competitive with pose-required and pose-free approaches.

Table 3: Novel view synthesis performance under 2-view setting on ACID.

MethodAvg. #GS ↓LPIPS ↓SSIM ↑PSNR ↑
Uncalibrated
VicaSplat131K0.2180.72624.548
F4Splat τ+\tau^+52K0.1880.78426.028
F4Splat τ\tau^-131K0.1760.79426.282

Qualitative Results

As shown in Fig. 6, F4Splat produces sharper details and more faithful reconstructions than baselines, even when using substantially fewer Gaussians (e.g., 24-29% of the baseline count).

Ablation Studies

Key components of F4Splat are validated through ablations (24 views, fixed 20% Gaussian budget):

Table 4: Ablation Studies.

VariantLPIPS ↓SSIM ↑PSNR ↑
(a) Random-based allocation0.1940.82824.68
(b) Frequency-based allocation0.1600.84125.36
(c) w/o level-wise GS train0.1920.81324.25
(d) w/o scene scale reg.0.7120.0064.82
(e) Ours (Full)0.1430.85425.47
  • (a) & (b): Replacing the learned densification score with random or simple frequency-based heuristics significantly degrades performance, proving the learned score's effectiveness.
  • (c): Removing level-wise Gaussian supervision during training hurts performance, confirming its necessity for stable optimization.
  • (d): Removing the scene-scale regularization causes training to fail, highlighting its critical role for stability in the uncalibrated setting.

Theoretical and Practical Implications

  • Theoretical: F4Splat successfully bridges a key gap between optimization-based and feed-forward 3DGS by making adaptive density control a learnable, feed-forward prediction. It demonstrates that gradient-based densification signals can be effectively distilled into a network.
  • Practical: The method provides:
    1. Explicit Budget Control: Users can directly specify the desired number of Gaussians for a scene, enabling trade-offs between quality and storage/rendering cost without retraining.
    2. Compact, High-Quality Representations: By allocating Gaussians efficiently—concentrating on complex details and avoiding redundancy—it achieves superior quality-per-Gaussian, reducing memory footprint and potentially accelerating rendering.
    3. Robustness: It works with sparse, uncalibrated input images, making it applicable to real-world scenarios where camera parameters are unknown.

Conclusion

F4Splat introduces a feed-forward predictive densification framework for 3D Gaussian Splatting. Its core innovation is a densification-score-guided allocation strategy that enables spatially adaptive Gaussian distribution from sparse, uncalibrated inputs. This allows explicit control over the Gaussian budget and produces compact yet high-fidelity 3D representations. Extensive experiments show F4Splat achieves state-of-the-art or competitive novel-view synthesis quality while using significantly fewer Gaussians than prior methods, validating the effectiveness of its adaptive allocation approach for efficient feed-forward 3DGS.