TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

Summary (Overview)

Simulation-Ready Mesh Output: TriSplat is a feed-forward model that directly reconstructs 3D scenes as oriented triangle primitives from sparse, unposed images. The output is an explicit triangle mesh that can be used immediately in physics engines (e.g., Unity, NVIDIA Isaac Sim) for simulation, collision detection, and robotic tasks, without any post-processing like TSDF fusion.
Geometry-Anchored Triangle Orientation: Instead of learning triangle orientation as an unconstrained variable, the model constructs it from predicted point-map geometry, refines it with an image-conditioned network, and stabilizes training with a monocular normal bootstrap schedule. This provides a strong geometric prior, leading to more faithful surfaces.
Progressive Sharpening Curriculum: The model uses scheduled opacity and blur parameters to transition from soft, forgiving primitives during early training to crisp, hard-edged surface triangles for final mesh export, ensuring stable optimization.
Superior Surface and Mesh-Rendering Quality: Experiments show TriSplat outperforms state-of-the-art Gaussian feed-forward baselines (e.g., YoNoSplat, MeshSplat) on surface accuracy metrics (Chamfer Distance, F1 score). Crucially, when exported meshes are rendered with a standard triangle rasterizer, TriSplat maintains high quality while Gaussian baselines suffer significant degradation due to lossy TSDF conversion.
High Efficiency: By eliminating the post-hoc mesh extraction step, TriSplat's end-to-end inference (from images to usable mesh) is significantly faster (e.g., ~0.6s for 6 views) compared to Gaussian baselines which require costly TSDF fusion that scales with scene volume.

Introduction and Theoretical Foundation

Reconstructing 3D scenes from images is fundamental for robotics, augmented reality, and embodied AI. For practical use in physics simulation, collision checking, and planning, the reconstruction must be an explicit triangle mesh, as this is the native format for engines like Unity, Unreal, and NVIDIA Isaac Sim.

While classical and learned multi-view pipelines can produce meshes, they rely on multi-stage, per-scene optimization and are sensitive to camera calibration and sparse views. Recent feed-forward models predict 3D representations (like 3D Gaussian Splatting primitives) directly from images, bypassing per-scene optimization. However, these methods use Gaussian primitives with only implicit surfaces. Extracting a usable mesh requires expensive post-hoc steps like TSDF fusion or Poisson reconstruction, which breaks the feed-forward promise and often degrades quality.

TriSplat addresses this gap by making the rendering primitive itself a surface element—an oriented triangle. This design is based on three key observations:

For simulation readiness, the rendering primitive must be a surface element (triangle) by construction.
Triangle orientation should be anchored to predicted local geometry rather than learned as an unconstrained variable, improving surface fidelity.
Triangles are sensitive to orientation errors, requiring explicit normal bootstrapping and validity-aware training for stability.

Methodology

Given a sparse set of V unposed images $\{ I_v \}_{v=1}^V$ , TriSplat predicts oriented triangle primitives, camera poses, and optional intrinsics in a single forward pass.

3.1 From Images to Triangle Primitives

Backbone: A DINOv2 backbone followed by a custom transformer decoder with alternating intra-view (local) and cross-view (global) attention blocks.
Prediction Heads: Three parallel heads predict:
1. Point Maps: A dense local 3D point map $P \in \mathbb{R}^{H \times W \times 3}$ per view. For each pixel $(u, v)$ , depth $z = \exp(z')$ and the 3D point is: $p = z \cdot (u, v, 1)^\top$
2. Camera Poses: One SE(3) camera-to-world pose per view, relative to the first view.
3. Primitive Attributes: Per-pixel attributes: density logit, scale logits, quaternion, spherical harmonics (appearance), and blur parameter.
Triangle Instantiation: Each triangle is instantiated from a canonical equilateral template $T \in \mathbb{R}^{3 \times 3}$ . The $k$ -th vertex is: $v_k = R_c R_n ( T_k \odot s ) + c, \quad k \in \{1,2,3\}$ where $c$ is the center from the point map, $s$ is the scale vector, $R_c$ is the camera-to-world rotation, and $R_n$ is the tangent-frame rotation that orients the triangle (derived next).

3.2 Anchoring Triangle Orientation to Geometry

To avoid unstable unconstrained orientation learning, triangle orientation is derived from predicted geometry.

Geometry Normals: Compute raw normals from the point map $P$ using finite differences: $n_{\text{geo}} = \text{normalize}(\Delta_x \times \Delta_y)$ A validity mask $m$ excludes border/degenerate pixels.
Learned Refinement: A lightweight U-Net $f_\theta$ refines the normal using appearance ( $I_v$ ), depth ( $D_v$ ), and the validity mask. The refined normal is: $n_{\text{ref}} = \text{normalize} ( n_{\text{sm}} + f_\theta ( n_{\text{geo}}, n_{\text{sm}}, I_v, D_v, m ) )$ The network is zero-initialized to start as an identity map for stability.
Mono-Normal Bootstrap: To warm-start training, teacher normals $n_{\text{tch}}$ from a pretrained monocular estimator [49] are blended with the model's normals via a time-varying coefficient $\alpha(t)$ : $n_{\text{fwd}} = \text{normalize} ( \alpha(t) n_{\text{tch}} + (1 - \alpha(t)) n_{\text{ref}} )$ The schedule has three phases: takeover ( $\alpha=1$ ), blending (cosine decay), and release ( $\alpha=0$ ).
Tangent Frame Construction: The final normal $n_{\text{fwd}}$ is used to construct an orthonormal frame $[t, b, n_{\text{fwd}}]$ , which becomes the rotation matrix $R_n$ for triangle orientation.

3.3 Progressive Surface Sharpening

To stabilize early training when predictions are coarse, two parameters are scheduled:

Opacity Scheduling: The density $p$ is mapped to opacity $o$ with an exponent $e(t)$ that ramps up, pushing values toward 0 or 1 (binarizing): $o = \frac{1}{2}\left(1 - (1-p)^{e(t)} + p^{e(t)}\right)$
Blur Scheduling: The blur parameter $\sigma$ decays from an initial soft value to a final crisp value: $\sigma = \text{sigmoid}(\hat{\sigma}) \cdot \beta(t)$ This transitions the representation from blurred, gradient-friendly primitives to sharp surface elements.

3.4 Training Objectives and Mesh Extraction

Training Loss: The model is trained end-to-end with: $\mathcal{L} = \mathcal{L}_{\text{photo}} + \mathcal{L}_{\text{cam}} + \mathcal{L}_{\text{normal}}$ combining photometric (RGB + LPIPS), pairwise relative camera pose, and normal alignment losses.
Mesh Extraction: This is trivial. After a forward pass, low-opacity triangles are filtered, winding order is corrected, and nearby vertices are merged. The result is a standard triangle mesh ready for use.

Empirical Validation / Results

Experiments were conducted on RealEstate10K (RE10K), DL3DV, and ScanNet (zero-shot). Baselines included Gaussian feed-forward methods (MVSplat, DepthSplat, AnySplat, YoNoSplat) and surface-aware variants (MeshSplat, SurfelSplat). Mesh rendering (using a standard triangle rasterizer on the exported mesh) is the primary evaluation, as it reflects simulation-ready utility.

Surface Reconstruction and Mesh Rendering Quality

Table 1: Surface quality on DL3DV (lower CD and higher F1 are better)

Method	6 views CD ↓	F1 ↑	12 views CD ↓	F1 ↑	24 views CD ↓	F1 ↑
MVSplat	1.143	0.118	0.802	0.135	0.695	0.156
DepthSplat	1.116	0.145	0.907	0.152	0.786	0.152
AnySplat	1.012	0.093	0.731	0.096	0.699	0.100
YoNoSplat	0.920	0.106	0.664	0.092	0.687	0.088
TriSplat (Ours)	0.613	0.287	0.323	0.279	0.310	0.277

Table系統 3: Quantitative comparison on RE10K (6 views)

Method	CD ↓	F1 ↑	PSNR ↑	LPIPS ↓
MVSplat	0.340	0.358	13.97	0.378
DepthSplat	0.294	0.429	21.23	0.271
AnySplat	0.540	0.110	18.23	0.365
YoNoSplat	0.267	0.443	21.94	0.238
MeshSplat	0.349	0.340	19.97	0.294
SurfelSplat	0.747	0.154	11.18	0.738
TriSplat (Ours)	0.190	0.622	24.69	0.269

Surface Geometry: TriSplat achieves the best surface accuracy (lowest Chamfer Distance, highest F1 score) on both datasets, significantly outperforming all baselines. This indicates more complete and faithful geometry.
Mesh Rendering: When the exported mesh is rendered with a triangle rasterizer, TriSplat also achieves the highest PSNR. Gaussian baselines suffer a quality drop due to the lossy TSDF conversion step, while TriSplat's rendering primitives are the mesh, so no information is lost.
Qualitative Results: Visualizations (Figs. 3-6) show TriSplat produces cleaner, more coherent textured meshes with preserved thin structures, while TSDF-fused baselines exhibit blurred boundaries, missing geometry, and fragmentation.

Depth and Normal Quality (Zero-Shot on ScanNet)

Table 4: Zero-shot depth and normal evaluation on ScanNet

Method	AbsRel ↓	AbsDiff ↓	Mean Normal Error ↓	<30° ↑
MVSplat	0.708	1.206	102.247	17.204
DepthSplat	0.279	0.595	54.861	29.403
AnySplat	0.453	0.283	55.557	25.375
YoNoSplat	0.270	0.516	54.110	41.047
MeshSplat	0.534	0.999	59.803	31.862
SurfelSplat	0.716	1.264	75.300	16.484
TriSplat (Ours)	0.188	0.341	27.901	71.708

TriSplat demonstrates strong cross-dataset generalization, achieving the best depth accuracy and significantly superior normal estimation (lower mean error, higher proportion within 30°), a direct benefit of its geometry-anchored normal pipeline.

Efficiency

Figure 8: End-to-end time-to-mesh comparison shows TriSplat is dramatically faster because it requires no post-processing. For 6 input views, TriSplat takes ~0.57 seconds, while the fastest Gaussian baseline (AnySplat) takes ~18.7 seconds (plus TSDF fusion time). This speed advantage grows with the number of views.

Ablation Study

Table 5: Ablation study on RE10K (6 views)

Configuration	CD ↓	F1 ↑	PSNR ↑	LPIPS ↓
Full model	0.190	0.708	23.25	0.318
w/o normal anchoring	0.190	0.651	22.14	0.396
w/o mono-normal bootstrap	0.198	0.643	22.17	0.397
w/o normal refinement	0.193	0.649	21.67	0.429
w/o progressive sharpening	0.191	0.646	21.81	0.416

Each component contributes significantly to final performance. Removing the mono-normal bootstrap causes the largest surface degradation, while disabling normal refinement hurts rendering quality the most.

Theoretical and Practical Implications

Representation Defines Simulation-Readiness: TriSplat demonstrates that choosing a triangle-native representation fundamentally solves the simulation-ready problem. The output is directly consumable by physics and rendering engines, eliminating a major pipeline bottleneck (mesh extraction) that plagues volumetric primitive-based methods.
Geometry as a Prior for Rendering: The method successfully shows that strongly anchoring rendering primitives (triangle orientation) to predicted scene geometry leads to superior surface fidelity without sacrificing novel-view synthesis quality. This challenges the notion that unconstrained, learned primitives are optimal for feed-forward reconstruction.
Curriculum Learning for Hard Primitives: The progressive sharpening schedule provides a generalizable strategy for training feed-forward models with hard-edged primitives, which are otherwise prone to gradient starvation during early optimization.

Conclusion

TriSplat presents a feed-forward 3D reconstruction model that natively outputs oriented triangle primitives, making it simulation-ready by design. By anchoring orientation to geometry, using a bootstrap schedule, and employing a progressive sharpening curriculum, it achieves state-of-the-art surface accuracy and mesh-rendering quality while being significantly more efficient than Gaussian-based methods that require post-hoc mesh extraction. The directly exported meshes work seamlessly with physics engines, bridging the gap between feed-forward reconstruction and practical embodied AI applications.

Limitations and Future Work: The exported mesh is a non-manifold "triangle soup" suitable for rendering and physics but not for applications requiring watertight meshes (e.g., finite-element analysis). Future work could explore topology-aware export and adaptive tessellation to control triangle density independent of input resolution.