Visual Summary | Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Summary (Overview)

Instance-structured 3D tokenization: Proposes a feed-forward framework that directly decomposes unposed multi-view images into instance-structured 3D token groups, making object instances first-class elements of the representation.
Two-level factorization: Each token group pairs an instance token (entity-level identity) with anchor tokens (local geometry/appearance), which are decoded into 3D Gaussians. This decouples object identity from local appearance.
Joint training from 2D supervision: The model is trained end-to-end using differentiable rendering with RGB reconstruction losses and 2D instance mask losses, requiring no 3D annotations. A linear warm-up on the segmentation loss stabilizes training.
State-of-the-art instance segmentation: Surpasses per-scene optimization baselines (Gaussian Grouping, ObjectGS) and feed-forward methods (IGGT+LUDVIG) in class-agnostic instance segmentation (AP 0.235 vs. next best 0.178) while remaining competitive in novel view synthesis.
Entity-level interface for downstream tasks: The token groups directly enable instance-level scene editing (removal, translation, insertion) and efficient open-vocabulary 3D instance retrieval (complexity scales with ~100 instances rather than 131K Gaussians).

Introduction and Theoretical Foundation

Background: Current feed-forward 3D reconstruction methods [28, 27, 5, 6, 11, 1] produce dense, unstructured sets of points or Gaussians. To add semantics, prior work attaches 2D foundation model features to each primitive, but this does not change the fundamental unit of representation — object-level information remains scattered, and operations like querying, editing, or reasoning still require post-hoc grouping or aggregation [31, 35, 26, 23, 4, 18].

Motivation: The authors argue that a primitive is fundamentally a local geometric fragment; regardless of what feature is attached, it cannot supply entity-level context. For high-level 3D understanding, the representation should make semantic entities first-class units while preserving access to fine-grained details.

Theoretical basis: The paper introduces a two-level factorization:

Instance tokens: Capture entity-level identity and extent.
Anchor tokens: Encode local geometry and appearance, each decoding into multiple 3D Gaussians.

This factorization separates what belongs together (instance structure) from how each part looks (local details), making object instances an explicit, manipulable interface.

Related work connections: The work builds on 3D Gaussian Splatting [12] and feed-forward Gaussian prediction methods [5, 6, 30, 11, 25, 1], but shifts the primary semantic units from primitives to entities. It draws inspiration from object-centric representation learning [17, 3, 7, 16, 10] and instance-aware 3DGS methods [31, 35], but unlike those, the instance structure is learned natively in a feed-forward manner without per-scene optimization.

Methodology

Multi-view feature encoding

Given (V) unposed RGB images, a frozen 3D foundation model (VGGT [27]) extracts multi-view features (F_i \in \mathbb{R}^{H' \times W' \times C}) and pointmaps (P_i \in \mathbb{R}^{H \times W \times 3}). After downsampling pointmaps and fusing with RGB patch features, we obtain context features (X = {x_j}_{j=1}^{VH'W'}) that serve as multi-view context for the token decoder.

Token group initialization

Anchor tokens are initialized via farthest point sampling over patch-aligned 3D coordinates: [ A_k^{(0)} = x_{a_k} + \phi_{\text{pos}}(a_k) ] where (x_{a_k}) is the context feature at the selected anchor position, and (\phi_{\text{pos}}) is an MLP projecting the 3D coordinate to feature dimension. (L=100) group tokens (G^{(0)}) are initialized as learnable embeddings.

Token group decoding

Two cross-attention transformers: [ A = D_{\text{anchor}}(A^{(0)}, X), \quad G = D_{\text{group}}(G^{(0)}, A) ]

(D_{\text{anchor}}) grounds anchor tokens in multi-view context.
(D_{\text{group}}) updates group tokens by attending to decoded anchors, aggregating object-level information.

Anchor-to-group assignment

Each decoded anchor's assignment probability over (L) groups: [ \pi_{k,\ell} = \text{softmax}\left{\langle A_k, G_{\ell'} \rangle\right}{\ell'=1}^L \Big|\ell ] This softmax induces competition among group tokens for anchor ownership, analogous to slot competition [17].

Gaussian reconstruction

Each anchor (A_k) is mapped to (N_g = 32) 3D Gaussians by a 2-layer MLP predicting offsets, scale, rotation, opacity, and spherical harmonics. Each Gaussian inherits the assignment score (\pi) of its parent anchor.

Training via joint reconstruction and segmentation

Rendering loss: [ \mathcal{L}{\text{render}} = \mathcal{L}{\text{mse}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} ]

Segmentation loss: Render assignment probabilities through alpha compositing to get instance probability maps ({M_\ell}). Perform Hungarian matching with ground-truth instance masks ({\hat{M}n}), then apply: [ \mathcal{L}{\text{seg}} = \lambda_{\text{bce}} \mathcal{L}{\text{bce}} + \lambda{\text{dice}} \mathcal{L}_{\text{dice}} ]

Full objective: [ \mathcal{L} = \mathcal{L}{\text{render}} + \lambda{\text{seg}} \mathcal{L}{\text{seg}} ] A linear warm-up on (\lambda{\text{seg}}) over the first 1,500 steps stabilizes early training.

Decomposed semantic feature distillation

Instead of attaching high-dimensional features to every Gaussian, use a decomposed representation:

Group-level embedding (s_\ell \in \mathbb{R}^D) (shared per instance)
Anchor-level residual (r_k \in \mathbb{R}^d) with (d \ll D)

At rendering, the full per-pixel semantic feature is reconstructed as: [ F_v(u) = \sum_\ell \hat{S}{v,\ell}(u) s\ell + W_r \hat{R}v(u) ] where (\hat{S}{v,\ell}) and (\hat{R}_v) are rendered group assignment and residual maps.

Distillation loss: [ \mathcal{L}{\text{distill}} = \sum_v \sum_u \left[ 1 - \cos(F_v(u), \Phi_v(u)) \right] + \sum{v,\ell} \left[ 1 - \cos\left( s_\ell, \text{avg}_{\hat{M}v^\ell}(\Phi_v) \right) \right] ] The first term matches the full reconstructed feature to foundation model output; the second directly supervises (s\ell) to capture object-level semantic summary.

Empirical Validation / Results

Experimental setup: Evaluation on ScanNet [8] with two configurations: 2 context views and 8 context views. Metrics include PSNR, SSIM, LPIPS for reconstruction; mIoU and pixel accuracy for feature lifting; AP, AP50, AP25 for class-agnostic instance segmentation.

Reconstruction and feature lifting (2 context views)

Table 1: Quantitative reconstruction and feature lifting results

Method	Target view mIoU ↑	Target view Acc. ↑	PSNR ↑	SSIM ↑	LPIPS ↓	#Sem. units	Feat. size
LSM [9]	0.512	0.795	24.24	0.821	0.222	131,072	67.1 M
Uni3R [25]	0.558	0.827	25.53	0.873	0.138	131,072	8.4 M
C3G [1]	0.513	0.783	23.89	0.770	0.285	2,048	1.0 M
Ours	0.657	0.789	25.28	0.771	0.238	<100	59.4 K

Our model achieves best mIoU on both source (0.661) and target (0.657) views by a clear margin.
Semantic storage reduced from 8.4M scalars (Uni3R) to 59.4K — a >140× reduction.
Reconstruction quality is competitive; the gap to pixel-aligned baselines (Uni3R) narrows in zero-shot transfer to MipNeRF360.

Class-agnostic instance segmentation (8 context views)

Table 2: Instance segmentation and reconstruction results

Type	Method	AP ↑	AP50 ↑	AP25 ↑	PSNR ↑	SSIM ↑	LPIPS ↓
Per-scene	Gaussian Grouping [31]	0.139	0.288	0.440	23.20	0.715	0.325
Per-scene	ObjectGS [35]	0.178	0.337	0.489	24.34	0.733	0.310
Feed-forward+opt	IGGT+LUDVIG [18]	0.122	0.265	0.442	22.75	0.712	0.323
Feed-forward	Ours	0.235	0.438	0.564	22.41	0.709	0.355

Our fully feed-forward model surpasses all baselines across all AP metrics.
Qualitative results show clean, consistent instance boundaries compared to fragmented, noisy outputs from baselines (Figure 5).

Ablations (2 context views)

Table 3: Joint training ablation

Training	PSNR ↑	SSIM ↑	LPIPS ↓	AP ↑	AP50 ↑	AP25 ↑
Sequential	23.65	0.737	0.348	0.032	0.097	0.315
w/o warm-up	23.09	0.732	0.329	0.081	0.186	0.415
Ours	25.11	0.769	0.240	0.193	0.377	0.529

Sequential training and training without warm-up both degrade performance significantly.

Table 4: Feature lifting decomposition ablation

Variant	mIoU ↑	Acc ↑
Anchor only (residuals)	0.524	0.713
Group only (shared)	0.635	0.767
Group + Anchor (full)	0.657	0.789

The full decomposition outperforms both variants, confirming the division of labor between group-level semantics and anchor-level detail.

Applications

Instance-level manipulation: Figure 6 shows removal, translation, insertion of objects by operating directly on token groups without post-processing.
Open-vocabulary 3D instance retrieval: Figure 7 demonstrates retrieving "sofa", "toilet", "chair" by matching text queries against group-level embeddings (complexity scales with ~100 instances, not 131K Gaussians).

Theoretical and Practical Implications

Representation paradigm shift: The work suggests that prior feed-forward reconstruction methods suffer from a representation mismatch — they treat scenes as unstructured bags of primitives, then attempt to recover object-level structure post-hoc. Instead, the paper shows that building object instances into the representation from the start (via token groups) leads to better instance segmentation and enables direct manipulation, all while requiring far fewer semantic units and storage.

Scalability and efficiency: By concentrating semantics at the instance level (group-level embedding + low-dimensional residuals), the representation reduces semantic storage from millions of scalars to tens of thousands. Retrieval complexity scales with number of instances (≤100) rather than primitives (131K), which is crucial for real-time applications.

Potential for compositional reasoning: The token group representation could connect 3D scenes to large language models and generative models. Since groups are organized around instances, a language model could treat a scene as a small set of entities, and a generative model could synthesize groups independently and compose them.

Robotics applications: The framework provides a natural interface for robotic perception and planning: instance-level handles for grounding language instructions ("pick up the chair"), manipulation interface for mental simulation of actions, and efficient forward prediction (operating over <100 tokens rather than thousands of primitives).

Broader scope: The method currently handles bounded indoor scenes with static objects. Extending to outdoor scenes, dynamic scenes, and real-time inference would enable object-centric world models for robotics.

Conclusion

Main contributions:

A feed-forward framework that reconstructs 3D scenes as instance-structured token groups from unposed multi-view images.
Two-level factorization: instance tokens capture entity-level identity, anchor tokens encode local geometry/appearance.
Joint training from 2D RGB and instance mask supervision without 3D annotations.
Decomposed semantic feature distillation reduces storage by orders of magnitude.
State-of-the-art class-agnostic instance segmentation (AP 0.235 vs. 0.178 next best), competitive reconstruction.
Direct instance-level editing and efficient open-vocabulary retrieval without post-processing.

Future directions:

Scaling to outdoor and large-scale scenes (fixing the upper bound of 100 groups).
Using multiple shared semantic tokens per group for complex instances.
Extending to dynamic scenes for robotic applications.
Connecting to large language models for composition reasoning and generation.

Limitations:

Currently evaluated on bounded indoor scenes.
Static scenes only; dynamic object manipulation requires extension.
Single shared group-level token may be insufficient for highly varied instances.