HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Summary (Overview)

  • Unified Multi-Modal Framework: HY-World 2.0 is the first open-source, systematic world model that seamlessly unifies 3D world generation (from text or single images) and 3D world reconstruction (from multi-view images or videos) within a single offline paradigm.
  • Four-Stage Generation Pipeline: For generation, it introduces a novel pipeline: 1) HY-Pano 2.0 for high-fidelity panorama generation, 2) WorldNav for semantic-aware trajectory planning, 3) WorldStereo 2.0 for memory-driven world expansion in a keyframe latent space, and 4) World Composition using the upgraded WorldMirror 2.0 for 3D Gaussian Splatting (3DGS) asset creation.
  • Key Technical Innovations: Major upgrades include an implicit geometry-free panorama generator, a keyframe-based video diffusion model with robust memory mechanisms (GGM & SSM++), and a feed-forward 3D reconstruction model with normalized position encoding for flexible resolution inference.
  • State-of-the-Art Performance: Extensive experiments show HY-World 2.0 achieves SOTA performance among open-source approaches and delivers results competitive with the closed-source commercial model Marble, while being significantly faster (10 minutes per world).
  • Open Release: The authors release all model weights, code, and technical details to support further research in 3D world modeling.

Introduction and Theoretical Foundation

World models are a transformative AI paradigm for simulating and interacting with complex 3D environments, with applications in VR, robotics, and gaming. Prior work, including the authors' HY-World 1.0 (offline 3D generation) and HY-World 1.5 (online video generation), advanced the field but remained bifurcated. Generative approaches synthesize explorable scenes from sparse inputs but lack reconstruction accuracy, while reconstruction methods recover precise 3D structures from dense views but cannot hallucinate unseen regions. Closed-source models like Marble have shown unification is possible, but an open-source, comprehensive multi-modal foundational world model was lacking.

HY-World 2.0 addresses this gap by designing a framework that dynamically adapts to diverse input modalities (text, single-view, multi-view images, video). Its core motivation is to bridge the "imaginative generation" and "accurate physical reconstruction" capabilities, leveraging the geometric rigor of 3D representations and the rich priors of video generation models to create high-fidelity, navigable 3D worlds.

Methodology

The framework is a four-stage pipeline (Fig. 2) for transforming multi-modal inputs into immersive 3D Gaussian Splatting (3DGS) worlds.

1. World Generation Stage I: Panorama Generation (HY-Pano 2.0)

  • Goal: Synthesize a high-fidelity 360° panorama from text or a single-view image to initialize the world.
  • Data: A hybrid dataset of high-resolution real-world panoramas and synthetic assets from Unreal Engine, filtered for quality.
  • Model: Uses a Multi-Modal Diffusion Transformer (MMDiT). Instead of explicit geometric warping (which requires camera metadata), it learns an implicit, adaptive mapping between the conditional image latent and the panoramic noise latent in a unified token sequence via self-attention.
  • Seamless Output: Employs circular padding in the latent space and linear pixel blending at the ERP edges to eliminate boundary discontinuities.

2. World Generation Stage II: Trajectory Planning (WorldNav)

  • Goal: Parse the generated panorama and derive optimal, collision-free camera trajectories for subsequent world expansion.
  • Scene Parsing: The panorama is processed to obtain:
    • A panoramic point cloud PpanP_{pan} via MoGe2 with increased view sampling.
    • A panoramic mesh.
    • 2D/3D semantic masks using Qwen3-VL and SAM3.
    • A Navigation Mesh (NavMesh) using Recast Navigation, refined for physical plausibility.
  • Trajectory Modes: WorldNav heuristically designs five complementary trajectory types (Table 1, Fig. 5):
    • Regular: General orbital expansion from the panorama center.
    • Surrounding: Circles around significant foreground objects.
    • Reconstruct-Aware: Iteratively targets under-observed, geometrically degenerate regions.
    • Wandering: Explores farthest reachable points to cover boundaries.
    • Aerial: Augments other trajectories with upward pitch for bird's-eye views.

3. World Generation Stage III: World Expansion (WorldStereo 2.0)

  • Goal: Generate extensive, consistent novel views along the planned trajectories.
  • Key Innovation - Keyframe-VAE: Moves from standard Video-VAE (spatio-temporal compression) to a Keyframe-VAE (spatial-only compression). This preserves high-fidelity details and reduces artifacts under large camera motion (Fig. 8, 9). Keyframes Vi{V_i} are encoded independently into latents Fi{F_i}.
  • Explicit Camera Control: Built on a pre-trained video DiT with a camera adapter. Guidance uses both Plücker rays and point clouds warped from a reference view: Ptari(x)RicwD(x)Ki1x^P^tar_i(x) \simeq R^{c\to w}_i D(x) K^{-1}_i \hat{x} where RicwR^{c\to w}_i and KiK_i are the target view's extrinsics and intrinsics, D()D(\cdot) is the reference depth, and x^\hat{x} is the homogeneous pixel coordinate.
  • Memory Mechanism (Mid-Training): Ensures consistency across generated trajectories.
    • Global-Geometric Memory (GGM): Uses an extended global point cloud Pglo=[Pref,P^]P_{glo} = [P_{ref}, \hat{P}] rendered into videos as a strong 3D prior, forcing the model to adhere to geometric structure.
    • Improved Spatial-Stereo Memory (SSM++): For fine-grained consistency. Retrieves relevant keyframes from a memory bank and horizontally stitches each with its target frame (Fig. 11). Uses implicit camera embeddings (encoded from normalized pose vectors) instead of explicit pointmaps. Employs full fine-tuning.
  • Training Stages: Domain-adaption (camera control), Middle-training (memory), Post-distillation (speed).
  • Post-Distillation: Applies Distribution Matching Distillation (DMD) to accelerate inference to 4 steps. The gradient is: LDMD=Et[(sreal(xt,t)sfake(xt,t))dxtdθdz]\nabla \mathcal{L}_{DMD} = -\mathbb{E}_t \left[ \int (s_{real}(x_t, t) - s_{fake}(x_t, t)) \frac{dx_t}{d\theta} dz \right]

4. World Reconstruction & Composition: WorldMirror 2.0

  • Role: A unified feed-forward 3D reconstruction model (Fig. 12) used for both standalone reconstruction and as the core geometry extractor in the world composition stage.
  • Key Improvements over WorldMirror 1.0 (Table 3):
    • Normalized Position Encoding: Replaces absolute RoPE indices with normalized coordinates [x^i,y^j][1,1][\hat{x}_i, \hat{y}_j] \in [-1, 1] (Eq. 4), converting resolution extrapolation into interpolation, enabling flexible high-res inference (Fig. 13).
    • Explicit Normal Supervision for Depth: Introduces a depth-to-normal loss Ld2n\mathcal{L}_{d2n} (Eq. 6) that couples depth and normal predictions, using either GT depth (synthetic) or pseudo-normals (real) as supervision.
    • Depth Mask Prediction: Adds a head to predict per-pixel validity, trained with BCE loss (Eq. 7).
    • Data: Adds UE synthetic data and uses monocular normal predictions as robust pseudo-labels.
    • Training Strategy: Uses token-budget dynamic batch sizing to maximize GPU utilization and a three-stage curriculum (geometry, geometry+Ld2n\mathcal{L}_{d2n}, 3DGS).
    • Inference Efficiency: Employs token/frame Sequence Parallelism (SP), BF16 mixed-precision, and Fully Sharded Data Parallelism (FSDP) for scalable multi-GPU deployment.

5. World Generation Stage IV: World Composition

  • Input: Panorama IpanI_{pan}, its point cloud PpanP_{pan}, and TexT_{ex} generated keyframes Vi,Ci{V_i, C_i}.
  • Step 1 - Point Cloud Expansion:
    • Reconstruction: A subset of frames is fed into WorldMirror 2.0 with camera pose priors to get depth Dim{D^m_i} and normal Nim{N^m_i} maps.
    • Depth Alignment: Aligns WorldMirror depth to the world coordinate of PpanP_{pan} via a RANSAC-based linear transform Dia=γiDim+βiD^a_i = \gamma_i D^m_i + \beta_i (Fig. 14). Uses a composite reliability mask MiM_i (Eq. 12) and an outlier detection strategy based on global coefficient distribution (Eq. 13).
  • Step 2 - 3D Gaussian Splatting:
    • Initialization: From the expanded, downsampled point cloud P~\tilde{P}.
    • Optimization: Uses view-independent RGB colors. Integrates MaskGaussian for adaptive pruning. The masked rasterization is: c(x)=k=1NMkckσkTk,Tk+1=Tk(1Mkσk)c(x) = \sum_{k=1}^{N} M_k c_k \sigma_k T_k, \quad T_{k+1} = T_k (1 - M_k \sigma_k) with a sparsity loss Lmask=λm(1Nk=1NMk)2\mathcal{L}_{mask} = \lambda_m (\frac{1}{N} \sum_{k=1}^{N} M_k)^2.
    • Losses: The total loss LGS=Lcolor+Lgeo+Lreg+Lmask\mathcal{L}_{GS} = \mathcal{L}_{color} + \mathcal{L}_{geo} + \mathcal{L}_{reg} + \mathcal{L}_{mask}, where Lcolor\mathcal{L}_{color} combines L1, SSIM, LPIPS and Lgeo\mathcal{L}_{geo} uses aligned depth and MoGe2 normals.
    • Mesh Extraction: Renders depth from all views, integrates into a TSDF volume, and extracts a mesh via marching cubes for collision detection.

Empirical Validation / Results

8.1 World Generation from Text or Single Image

  • HY-Pano 2.0: Achieves best scores on CLIP and Q-Align metrics for both T2P and I2P tasks (Table 4). Qualitatively, it produces more coherent layouts, finer details, and better aesthetics than competitors (Fig. 17, 18).
  • WorldNav: Ablation shows each trajectory mode progressively improves scene completeness and eliminates blind spots (Fig. 19).
  • WorldStereo 2.0:
    • Single-View Reconstruction: Achieves highest point cloud F1 and AUC scores on Tanks-and-Temples and MipNeRF360 (Table 5).
    • Camera Control: Outperforms all video-based competitors in camera metrics (RotErr, TransErr) and visual quality (Table 6).
    • Ablations: Keyframe-VAE with frozen cross-attention & FFN layers gives the best balance of camera control and visual quality (Table 7). Memory training (GGM+SSM++) significantly improves photometric metrics and consistency (Table 8).
  • World Composition:
    • Reconstruction: WorldMirror 2.0 with depth alignment produces higher-quality point clouds faster (2 min vs 5 hours) than video2world [21] (Fig. 20).
    • 3DGS: The full pipeline (voxel downsample + adaptive densification (non-sky) + MaskGaussian) reduces Gaussian count by 77% while maintaining near-baseline quality (PSNR 25.023 vs 25.176) (Table 9).
  • Full System Comparison with Marble: HY-World 2.0 generates 3DGS worlds with higher fidelity to input conditions, sharper textures, and better geometric consistency across novel views (Fig. 23, 24).
  • Runtime: The end-to-end pipeline generates a complete 3D world in ~712 seconds (10 minutes) (Table 10).

8.2 World Reconstruction from Multi-View Images or Video

  • WorldMirror 2.0 Benchmarks: Evaluated across resolutions (Low/Medium/High).
    • Point Map Reconstruction: Outperforms all baselines. At high resolution, WM 2.0 accuracy (0.037) is vastly better than WM 1.0 (0.079) on 7-Scenes (Table 11).
    • Camera Pose & Depth: WM 2.0 maintains high AUC@30 (~86.9) at high resolution, while WM 1.0 collapses to 66.3 (Table 12).
    • Novel View Synthesis: WM 2.0 maintains stable PSNR (~20.0) across resolutions, while WM 1.0 drops from 21.34 to 17.78 (Table 12).
    • Surface Normal Estimation: Achieves best results across ScanNet, NYUv2, iBims-1 at medium resolution (mean error 12.3) and generalizes well to high resolution (12.5) (Table 13).
  • Qualitative: WM 2.0 produces sharper normals, more consistent point clouds (Fig. 25), and stable reconstructions at high resolution where WM 1.0 fails (Fig. 26).
  • Prior Injection: WM 2.0 demonstrates stronger use of geometric priors (camera, intrinsics, depth) than competitors like MapAnything, especially at high resolution (Fig. 27).
  • Inference Efficiency: Combining SP, BF16, and FSDP on 4 GPUs enables processing 256 views at 518×378 resolution in 17.52s with 78.78 GB/GPU, a 3.2x speedup over the baseline (Table 14).

Theoretical and Practical Implications

  • Theoretical: HY-World 2.0 provides a proven, open-source blueprint for unifying generative and reconstructive paradigms in 3D world modeling. It demonstrates the effectiveness of key technical innovations: implicit panorama generation, keyframe-based diffusion with geometric memories, and normalized encoding for resolution-agnostic inference.
  • Practical: The framework enables the rapid creation (10 minutes) of high-fidelity, navigable 3D assets from versatile inputs, directly applicable to game development, virtual reality, robotics simulation, and environment mapping. The release of all components democratizes access to state-of-the-art world modeling technology, fostering further research and application development. The included WorldLens rendering platform supports interactive exploration with characters, lighting, and collision detection.

Conclusion

HY-World 2.0 presents a comprehensive, open-source multi-modal world model that successfully bridges 3D world generation and reconstruction. Through systematic advancements in its four-stage pipeline—panorama generation, trajectory planning, world expansion, and composition—it achieves state-of-the-art performance, competitive with leading closed-source models. The work's main takeaways are the viability of a unified framework, the critical importance of memory and geometric consistency in generative expansion, and the value of resolution-flexible reconstruction. By releasing all models and code, the authors aim to establish a robust foundation for future research in spatial intelligence and 3D world modeling.