F4Splat: Feed-Forward Predictive Densification for Feed-Forward 3D Gaussian Splatting
Summary (Overview)
- Problem: Existing feed-forward 3D Gaussian Splatting (3DGS) methods inefficiently allocate Gaussians, using uniform (pixel/voxel-based) pipelines that lead to redundant Gaussians across views and lack control over the final Gaussian count.
- Core Solution: F4Splat introduces a densification-score-guided allocation strategy that predicts per-region scores to estimate required Gaussian density, enabling spatially adaptive, non-uniform allocation.
- Key Features:
- Enables explicit control over the final Gaussian budget without retraining.
- Reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views.
- Produces compact, high-fidelity 3D representations from sparse, uncalibrated input images.
- Result: Achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods while using significantly fewer Gaussians.
Introduction and Theoretical Foundation
3D Gaussian Splatting (3DGS) has emerged as a highly efficient method for 3D scene reconstruction and real-time rendering, using explicit 3D Gaussian primitives. A key strength of optimization-based 3DGS is its Adaptive Density Control (ADC), which iteratively adds or removes Gaussians during training to allocate them efficiently based on scene complexity.
However, conventional 3DGS requires costly per-scene iterative optimization and dense input views. Feed-forward 3DGS methods address this by learning strong 3D priors from large datasets, enabling single-pass reconstruction from sparse views. Yet, these methods eliminate the iterative ADC process. Most adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines, which:
- Uniformly allocate Gaussians, leading to redundancy in simple regions and insufficient detail in complex ones.
- Fix or inflexibly couple the total number of Gaussians to input resolution or voxel size, preventing explicit budget control.
F4Splat addresses these limitations by reintroducing the concept of adaptive densification into the feed-forward paradigm. It formulates Gaussian densification as a learnable prediction problem, where a network estimates a "densification score" for each spatial region, indicating how many Gaussians are needed.
Methodology
The goal is to develop a feed-forward network that, given uncalibrated context images and a user-specified target Gaussian budget , predicts a set of 3D Gaussians and camera parameters :
Each Gaussian is parameterized by its center , opacity , rotation (quaternion) , scale , and spherical harmonics .
Network Architecture
The framework has three main components (Fig. 2):
- Geometry Backbone: Based on VGGT, it uses a pretrained DINOv2 encoder and alternating self-attention layers to extract geometric information and predict camera parameters from the input images.
- Multi-Scale Prediction Heads: A modified DPT-based decoder with two parallel heads predicts multi-scale Gaussian parameter maps and densification score maps from the encoded image tokens.
- Spatially Adaptive Gaussian Allocation: Uses the predicted densification scores to decide the appropriate representation level for each region via a thresholding rule, enabling non-uniform allocation.
Spatially Adaptive Gaussian Allocation
The allocation uses binary masks to select Gaussians from different scales based on a threshold :
where is an indicator function, is element-wise product, is nearest-neighbor upsampling by factor , and is a matrix of ones. This ensures Gaussians are selected exclusively across levels.
Given a target budget , a budget-matching algorithm finds the threshold that satisfies:
Training Strategy: Feed-Forward Predictive Densification
The densification score must be predictable at inference. Inspired by ADC in optimization-based 3DGS, the model learns to predict scores from homodirectional view-space positional gradients derived during training.
For a predicted Gaussian set , the rendering loss (MSE + LPIPS) between a rendered novel view and ground truth is computed. The gradient for Gaussian is:
where is the Gaussian's projected 2D center, and is the number of pixels it affects. A large indicates the region is underrepresented.
The supervision signal for the densification score is defined as:
The score loss is:
Novel View Training & Alignment: To avoid overfitting to context views, the model is supervised using novel target views. Since the predicted camera coordinate system differs from ground truth, a similarity transformation matrix is estimated to align the ground-truth target pose to the predicted frame: .
Total Loss: The overall training objective is a weighted sum:
where is a scene-scale regularization that normalizes the average distance of Gaussian centers from the origin to 1.
Empirical Validation / Results
Models were trained on RealEstate10K (RE10K) and ACID datasets and evaluated against state-of-the-art pose-free and uncalibrated feed-forward 3DGS methods.
Quantitative Results
Multi-View Evaluation (RE10K): F4Splat achieves competitive or superior performance while using far fewer Gaussians. With a high threshold (, using ~20-30% of baseline Gaussians), it matches or beats baselines. With a low threshold (, using a similar number), it consistently outperforms them.
Table 1: Novel view synthesis performance on RE10K under different numbers of input views.
| Method | 8 views | 16 views | 24 views |
|---|---|---|---|
| #GS ↓ | LPIPS ↓ SSIM ↑ PSNR ↑ | #GS ↓ | |
| Uncalibrated Baselines | |||
| VicaSplat | 524K | 0.258 0.686 20.77 | 1049K |
| AnySplat | 447K | 0.167 0.819 24.07 | 820K |
| F4Splat | 105K | 0.142 0.847 25.26 | 210K |
| F4Splat | 447K | 0.131 0.859 25.64 | 820K |
Generalization to Unseen Data (ACID): F4Splat generalizes well, maintaining strong performance on the unseen ACID dataset.
Table 2: Generalization to unseen datasets (ACID).
| Method | 8 views | 16 views | 24 views |
|---|---|---|---|
| Uncalibrated Baselines | |||
| AnySplat | 481K | 0.248 0.696 23.30 | 906K |
| F4Splat | 52K | 0.239 0.713 24.28 | 105K |
| F4Splat | 481K | 0.204 0.744 24.83 | 906K |
Two-View Evaluation (ACID): F4Splat achieves superior performance among uncalibrated methods and remains competitive with pose-required and pose-free approaches.
Table 3: Novel view synthesis performance under 2-view setting on ACID.
| Method | Avg. #GS ↓ | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
|---|---|---|---|---|
| Uncalibrated | ||||
| VicaSplat | 131K | 0.218 | 0.726 | 24.548 |
| F4Splat | 52K | 0.188 | 0.784 | 26.028 |
| F4Splat | 131K | 0.176 | 0.794 | 26.282 |
Qualitative Results
As shown in Fig. 6, F4Splat produces sharper details and more faithful reconstructions than baselines, even when using substantially fewer Gaussians (e.g., 24-29% of the baseline count).
Ablation Studies
Key components of F4Splat are validated through ablations (24 views, fixed 20% Gaussian budget):
Table 4: Ablation Studies.
| Variant | LPIPS ↓ | SSIM ↑ | PSNR ↑ |
|---|---|---|---|
| (a) Random-based allocation | 0.194 | 0.828 | 24.68 |
| (b) Frequency-based allocation | 0.160 | 0.841 | 25.36 |
| (c) w/o level-wise GS train | 0.192 | 0.813 | 24.25 |
| (d) w/o scene scale reg. | 0.712 | 0.006 | 4.82 |
| (e) Ours (Full) | 0.143 | 0.854 | 25.47 |
- (a) & (b): Replacing the learned densification score with random or simple frequency-based heuristics significantly degrades performance, proving the learned score's effectiveness.
- (c): Removing level-wise Gaussian supervision during training hurts performance, confirming its necessity for stable optimization.
- (d): Removing the scene-scale regularization causes training to fail, highlighting its critical role for stability in the uncalibrated setting.
Theoretical and Practical Implications
- Theoretical: F4Splat successfully bridges a key gap between optimization-based and feed-forward 3DGS by making adaptive density control a learnable, feed-forward prediction. It demonstrates that gradient-based densification signals can be effectively distilled into a network.
- Practical: The method provides:
- Explicit Budget Control: Users can directly specify the desired number of Gaussians for a scene, enabling trade-offs between quality and storage/rendering cost without retraining.
- Compact, High-Quality Representations: By allocating Gaussians efficiently—concentrating on complex details and avoiding redundancy—it achieves superior quality-per-Gaussian, reducing memory footprint and potentially accelerating rendering.
- Robustness: It works with sparse, uncalibrated input images, making it applicable to real-world scenarios where camera parameters are unknown.
Conclusion
F4Splat introduces a feed-forward predictive densification framework for 3D Gaussian Splatting. Its core innovation is a densification-score-guided allocation strategy that enables spatially adaptive Gaussian distribution from sparse, uncalibrated inputs. This allows explicit control over the Gaussian budget and produces compact yet high-fidelity 3D representations. Extensive experiments show F4Splat achieves state-of-the-art or competitive novel-view synthesis quality while using significantly fewer Gaussians than prior methods, validating the effectiveness of its adaptive allocation approach for efficient feed-forward 3DGS.