Map2World: Segment Map Conditioned Text to 3D World Generation

Summary (Overview)

Flexible Segment Map Conditioning: Introduces a novel framework for generating 3D worlds from user-defined segment maps of arbitrary shapes and scales, moving beyond restrictive grid-based layouts.
Latent Fusion for Global Coherence: Proposes a multi-window latent fusion strategy within a structured latent space to ensure seamless connections and global-scale consistency across expansive environments.
Detail Enhancement Network: Designs a detail enhancer network that adds fine-grained details to the generated world while preserving the overall global structure and scene coherence.
Domain-Generalized Generation: Leverages strong priors from a pre-trained 3D asset generator (TRELLIS) to achieve robust performance across diverse domains, even with limited world-scale training data.
Superior Performance: Demonstrates significant improvements over existing methods in user controllability, scale consistency, and content coherence through extensive qualitative and quantitative evaluations.

Introduction and Theoretical Foundation

Three-dimensional world generation is crucial for applications like immersive content creation and autonomous driving simulation. While recent advances in 3D asset generation have been promising, scaling to the world level remains challenging due to the lack of high-quality world-scale datasets. Existing methods often rely on grid-based layouts (e.g., SynCity), leading to inconsistencies in object scale and weak contextual connections between adjacent tiles. They also struggle with arbitrary-shaped regions, which are common in real-world scenarios.

This paper introduces Map2World, a text-conditioned 3D world generation framework that overcomes these limitations. The core idea is to leverage the powerful prior of a state-of-the-art 3D asset generator, TRELLIS, which uses a structured latent (SLAT) representation. The structured latent $s$ encodes geometry and appearance as a set of local features on a sparse 3D grid:

s = \{ (z_i, p_i) \}_{i=1}^L,

where $p_i \in \{0, 1, ..., N-1\}^3$ is the positional index of an active voxel, $z_i \in \mathbb{R}^C$ is the corresponding latent feature vector, and $L$ is the number of active voxels.

Map2World builds upon this representation with two key innovations: 1) a latent fusion strategy to expand generation to large, user-defined areas while maintaining coherence, and 2) a detail enhancer to upscale the resolution and quality of the generated world.

Methodology

The overall pipeline consists of two main stages (visualized in Fig. 2 of the paper): 1) generating a coarse structured latent for the entire world conditioned on a segment map, and 2) enhancing the details of this latent.

4.1 Expanding Spatial Regions in 3D Latent Space

To generate worlds larger than TRELLIS's native 64³ cube, Map2World employs a latent fusion strategy inspired by 2D multi-window diffusion.

Latent Fusion for Rectified Flow Models: The space is split into overlapping 3D cube windows $\{ \Omega_j \}$ . For a position $x$ , the fused velocity field $v_t(x|y)$ is computed by aggregating predictions from all covering windows $A(x) = \{ j | x \in \Omega_j \}$ using a Gaussian kernel $W(\cdot)$ :
$v_t(x|y) = \frac{\sum_{j \in A(x)} W(x - c_j) v_{t,j}(x|y)}{\sum_{j \in A(x)} W(x - c_j)},$
where $c_j$ is the center of $\Omega_j$ . The latent is updated via the rectified flow formulation: $s_{t-1}(x|y) = s_t(x|y) - \Delta t \cdot v_t(x|y)$ .
Segment-Map-Guided Latent Fusion: To condition generation on a segment map with $K$ regions (binary masks $M_k$ and text prompts $y_k$ ), the velocity is a weighted sum:
$\tilde{v}_t(x) = \frac{\sum_{k=1}^K (M_k(x) \odot G(\sigma_t)) \cdot v_t(x|y_k)}{\sum_{k=1}^K (M_k(x) \odot G(\sigma_t))}.$
$G(\sigma_t)$ is a normalized 3D Gaussian kernel with time-dependent standard deviation $\sigma_t$ , ensuring smooth transitions between segments.
Optimization for Scale-Aware Initial Latent: To guide the global scale of the generated world, the initial noisy latent $S_T$ for the sparse structure is optimized. Using a linear approximation of the denoising trajectory, $S(t) \approx S_T + (1 - t/T)[G_S(S_T) - S_T]_{sg}$ , the loss is:
$\mathcal{L}_{linear} = \| y - \mathcal{M}([G_L(S_T) - S_T]_{sg} + S_T) \|_2^2,$
where $\mathcal{M}$ is a target mask and $y$ a target constraint. Spectral-domain parameterization (using 3D FFT) of $S$ stabilizes this optimization.

4.2 Enriching Details in 3D Latent Space

A detail enhancer network is proposed to upscale the resolution and add fine details to the coarse world latent.

Data & Training: Pairs of "large cube" latents ( $s_O$ ) and their subdivided "small cube" latents ( $s_j, j=0,...,7$ ) are extracted from scene datasets. The enhancer is trained to predict $s_j$ conditioned on $s_O$ .
Network Architecture: The design integrates a new MLP layer ( $\mathcal{F}_\theta$ $F_{θ}$ ) with the frozen TRELLIS flow transformers ( $G_{S/L}$ $G_{S / L}$ ). Conditions are:
1. Truncated large cube latent ( $s_{O|j}$ ): Provides coarse scene information.
2. Adjacent cube latents ( $s_{Adj(j)}$ ): Ensures seamless connections. These are concatenated with the noise and processed by $\mathcal{F}_\theta$ before being fed to $G_{S/L}$ . The process is summarized as:
$v_\theta(s^j_t) = G_{S/L}\left( \mathcal{F}_\theta\left( s^j_t, s_{O|j}, s_{Adj(j)} \right), t \right),$ where $s^j_t = (1-t)s^j + t\epsilon$ $s_{t}^{j} = (1 - t) s^{j} + t ϵ$ . The model is fine-tuned using a flow matching loss: $\mathcal{L}_\theta = \mathbb{E}_{s^j, t} \| v_\theta(s^j_t) - (\epsilon - s^j) \|_2^2$ $L_{θ} = E_{s^{j}, t} ∥ v_{θ} (s_{t}^{j}) - (ϵ - s^{j}) ∥_{2}^{2}$ . Sampling is auto-regressive across the 8 small cubes.

4.3 Fine-tuning SLAT Decoder

The TRELLIS decoder ( $D_L$ ) is fine-tuned on small cubes from scene data to better reconstruct partial scene geometry and textures, improving the final 3D representation quality (e.g., 3D Gaussian Splatting).

Empirical Validation / Results

Dataset: 35 high-quality scene meshes from Objaverse, filtered via NuiScene43 labels. 17,500 cube pairs were extracted for training the detail enhancer.

Qualitative Comparison:

Arbitrary-Shaped Maps: Map2World successfully generates coherent worlds from free-form segment maps, a task SynCity cannot handle (Fig. 3).
Grid Maps: Compared to SynCity on grid-type maps, Map2World produces larger connected structures, denser scenes, and seamless transitions between tiles, avoiding empty gaps and disconnected assets (Fig. 4).

Quantitative Evaluation: A new composite metric, World Quality (WQ), is proposed to evaluate structural quality:

WQ = 0.15S + 0.45W + 0.25C + 0.15R,

where $S$ =Sharpness, $W$ =World completeness, $C$ =Coherence, $R$ =Realism (all scored by GPT).

Table 1: Evaluation using the proposed World Quality (WQ) metric

Model	S (sharpness)	W (world completeness)	C (coherence)	R (realism)	WQ
GaussianCube	6.8	4.5	5.0	5.1	5.08
SynCity	8.2	6.8	7.6	7.3	7.25
Map2World	8.0	7.8	7.9	7.6	7.76

Map2World achieves the highest WQ score, outperforming baselines in world completeness and coherence. GPTscore evaluation on 35 scenes also shows Map2World (7.93/10) outperforming SynCity (7.48/10).

Ablation Studies:

Spectral Parameterization: Enables stable and rapid convergence in initial latent optimization for scale control (Fig. 6).
Detail Enhancer Design: Ablations (Fig. 7, Table 2) confirm the superiority of the proposed concatenation-based MLP architecture over alternatives like IP, the detriment of using Classifier-Free Guidance (CFG), and the benefit of decoder fine-tuning.

Table 2: Quantitative comparison for the detail enhancer design

Architecture	CFG	$D_L$ F.T.	PSNR ↑	LPIPS ↓	FID (Incep.v3) ↓
(a) Concatenation (Ours)	No	Yes	22.53	0.2137	16.98
(b) IP-Adapter	No	Yes	20.28	0.2499	29.62
(c) Concatenation	Yes	Yes	21.95	0.2174	19.06
(d) Concatenation	No	No	22.08	0.2165	17.89

Theoretical and Practical Implications

Theoretical Implications:

Demonstrates the effective extension of 2D multi-window diffusion paradigms to 3D volumetric latent spaces.
Shows that powerful asset-generator priors can be successfully leveraged and controlled for large-scale scene generation through latent-space manipulation techniques (fusion, conditioning, optimization).
Introduces a novel training paradigm for detail enhancement that operates in the latent space of a generative model, circumventing the need for paired high-low resolution 3D data.

Practical Implications:

Enhanced Controllability: Provides artists and developers with an intuitive tool (segment maps) to design complex 3D worlds with precise semantic and spatial control.
Scalable Content Creation: Enables the generation of expansive, coherent virtual environments for gaming, simulation, and VR/AR applications.
Data-Efficient Pipeline: Reduces reliance on massive world-scale datasets by bootstrapping from existing asset generators.

Conclusion

Map2World presents a significant advancement in controllable, large-scale 3D world generation. By innovating in latent space fusion for arbitrary segment map conditioning and detail enhancement, it achieves superior global coherence, flexibility, and quality compared to existing grid-based methods. The framework effectively bridges the gap between high-quality 3D asset generation and the creation of expansive, consistent 3D worlds.

Future Directions: Include improving the model's handling of absolute positional encodings, training the detail enhancer on more diverse and photorealistic data, and exploring the integration of relative positional encodings in the base model for better merging behavior.