Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

Summary (Overview)

  • Core Contribution: Code-as-Room (CaR) is a novel Multimodal Large Language Model (MLLM)-based agentic framework that generates executable Blender code to synthesize complete, editable 3D indoor scenes from a single top-down view image.
  • Structured Execution: To overcome the instability and infinite looping of existing agents, CaR introduces a structured execution harness that decomposes the complex task into a principled, multi-stage pipeline: scene structuring, layout generation, object profiling, object-level code generation, and interior decoration.
  • Cross-Stage Memory: A key innovation is a cross-stage memory module that stores intermediate outputs (e.g., scene graphs, object profiles) and shares them across stages, effectively mitigating the pervasive "context forgetting" problem in long agentic workflows.
  • Comprehensive Benchmark: The authors introduce a dedicated benchmark for evaluating code-based 3D room synthesis, assessing models across visual understanding, spatial reasoning, code generation, and holistic scene quality.
  • Superior Performance: Extensive experiments show that CaR's harness significantly improves the stability and quality of generation for various MLLM backbones (Gemini, GPT), outperforming direct generation and existing agent-based methods like VIGA in terms of layout alignment, spatial consistency, and scene usability.

Introduction and Theoretical Foundation

Designing realistic 3D indoor rooms is essential for interior design, virtual reality, gaming, and embodied AI. Manual creation is labor-intensive, and while procedural or rule-based methods exist, they are limited by hand-crafted rules. Recent MLLM-based approaches show promise but have key limitations:

  • Text-based methods struggle to capture precise spatial information from descriptions.
  • Existing image-conditioned agents (e.g., VIGA) suffer from instability and infinite loops when tasked with holistic room generation from top-down views, failing to recover fine-grained details.

The paper proposes a new paradigm: using a top-down layout image as a global spatial prior to guide complete 3D room generation. This mirrors real-world design workflows where floor plans are common starting points. The core idea is to represent the final 3D scene as executable Blender code, which offers interpretability and editability. The challenge is to stably translate a complex visual input into a coherent, multi-faceted code program.

Methodology

The CaR framework is a multi-stage agentic pipeline orchestrated by a structured execution harness. The overall process is formulated as generating executable Blender code CC from an input image II via an agent AA:

C=A(I)C = A(I)

Due to complexity, the process is decomposed into coarse (CUCU) and fine (FUFU) stages:

DCU=UCU(I,M),Clayout=GCG(I,DCU,M)D_{CU} = U_{CU}(I, M), \quad C_{layout} = G_{CG}(I, D_{CU}, M) DFU=UFU(I,Clayout,M),C=GFG(I,Clayout,DFU,M)D_{FU} = U_{FU}(I, C_{layout}, M), \quad C = G_{FG}(I, C_{layout}, D_{FU}, M)

where DD denotes understanding results, GG denotes generators, and MM is the cross-stage memory.

1. Cross-Stage Memory System A shared memory MM stores typed artifacts es=s,τs,Os,ηse_s = \langle s, \tau_s, O_s, \eta_s \rangle from each stage ss. Memory is updated via:

Ms=Ms1esM_s = M_{s-1} \oplus e_s

Downstream stages read predefined views of MM, ensuring consistency and reducing prompt noise.

2. Image-based Scene Structuring (Stages 1-2)

  • Stage 1: Spatial Semantic Analysis. The VLM parses II to extract a schema-constrained description D1D_1 of functional zones, object hierarchies (with identifiers, categories, placement types), and architectural elements (walls, doors, windows).
  • Stage 2: Object-centric Scene Graph Construction. Reads {I,D1}\{I, D_1\} to build a deterministic skeleton scene graph S={Varch,Vmajor,Eparent,Mminor}S = \{V_{arch}, V_{major}, E_{parent}, M_{minor}\}. The VLM completes attributes and forward relations. The final graph G=(V,E)G=(V,E) and minor-object sidecar MminorM_{minor} are written to memory.

3. Layout Code Generation (Stages 3-4) Generates coarse layout program ClayoutC_{layout} where objects are instantiated as bounding-box proxies.

  • Stage 3: Major Layout with Visual Feedback. Implements a render–critique–revise loop:
    • Initial code: C(0)=Generate(I,D1,G)C^{(0)} = Generate(I, D_1, G)
    • Render: R(t)=Render(C(t1))R^{(t)} = Render(C^{(t-1)})
    • Critique: (A(t),st)=Critique(I,R(t),G)(A^{(t)}, s_t) = Critique(I, R^{(t)}, G) (VLM provides feedback A(t)A^{(t)} and score sts_t)
    • Sanitize: eA(t)=Sanitize(A(t),D1,G)eA^{(t)} = Sanitize(A^{(t)}, D_1, G)
    • Revise: C(t)=Revise(C(t1),eA(t))C^{(t)} = Revise(C^{(t-1)}, eA^{(t)}) The loop runs for a maximum Tmax=5T_{max}=5 iterations or until stss_t \ge s^\star. Output is ClayoutmajorC^{major}_{layout}.
  • Stage 4: Auxiliary Layout. Appends wall-mounted objects and visually salient minor objects MminorM^{\star}_{minor} (e.g., rugs, large plants) to ClayoutmajorC^{major}_{layout}, resulting in the final ClayoutC_{layout}.

4. Object-level Code Generation (Stages 5-7)

  • Stage 5: Layout-grounded Object Profiling. Parses ClayoutC_{layout} and uses the VLM, conditioned on II and memory, to infer fine-grained object descriptions DFUD_{FU} (color, material, function, style).
  • Stage 6: Object Geometry Replacement. For each object oio_i, a geometry agent predicts a semantic 3D primitive decomposition: Pi=Φgeo(oi,di)={pi,j}j=1KiP_i = \Phi_{geo}(o_i, d_i) = \{p_{i,j}\}_{j=1}^{K_i} where diDFUd_i \in D_{FU}. Proxy constructors in ClayoutC_{layout} are replaced with part-based constructors to create CgeomC_{geom}.
  • Stage 7: Asset Retrieval for Tiny Objects. For complex small items, simple placeholders are created and then replaced by retrieved assets bb^\star from a library BB, selected via: b=argmaxbBmatch(b;label,description,placeholder size)b^\star = \arg\max_{b \in B} \text{match}(b; \text{label}, \text{description}, \text{placeholder size})

5. Interior Decoration Code Generation (Stages Corner8-10) A geometry-preserving code rewriting pipeline: CobjApplyMatCmatApplyTexCtexRenderSetupCrawC_{obj} \xrightarrow{ApplyMat} C_{mat} \xrightarrow{ApplyTex} C_{tex} \xrightarrow{RenderSetup} C_{raw}.

  • Stage 8: Material Assignment. Assigns part-level PBR materials based on part dictionary and descriptions.
  • Stage 9: Texture and Decorative Surfaces. Uses an image generation model to synthesize texture maps for large surfaces (walls, floors) and injects them.
  • Stage 10: Lighting, Rendering, and Post-hoc Correction. VLM infers lighting style, translated into Blender lights. A deterministic correction pass fixes common issues and performs local search for movable objects with boundary violations: xi=argminxN(x^i)xx^i2s.t.B(oi,x)Broom,B(oi,x)B(oj)=x_i^\star = \arg\min_{x \in N(\hat{x}_i)} \|x - \hat{x}_i\|_2 \quad \text{s.t.} \quad B(o_i, x) \subseteq B_{room}, \quad B(o_i, x) \cap B(o_j) = \emptyset where x^i\hat{x}_i is the generated position and N(x^i)N(\hat{x}_i) is a local neighborhood. The final program is C=PostHoc(Craw)C = PostHoc(C_{raw}).

Empirical Validation / Results

1. Benchmark for Top-down Image to 3D Room A test suite of 41 diverse scenes (residential & specialized, simple to hard, various image styles) was created. Human annotators corrected coarse labels from an initial CaR run to establish annotations. The benchmark evaluates VLMs across four aspects:

Table 1: Benchmark results evaluating different VLMs with and without the Code-as-Room harness.

VLMVisual UnderstandingSpatial ReasoningCode GenerationScene Quality
Obj. Recall ↑Func. Acc. ↑Self Overlap ↓Layout IoU ↑
Gemini3.1-pro [9]17.8%15.3%8.4%16.8%
GPT-5.5 [18]42.2%71.7%14.5%46.2%
Gemini3-flash w/CaR [8]58.9%88.42%2.57%72.0%
Gemini3.1-pro w/CaR [9]55.5%84.3%3.3%73.2%
GPT-5.5 w/CaR [18]67.5%72.54%10.5%66.7%

Key Findings:

  • The CaR harness dramatically improves all metrics for VLMs compared to direct single-pass generation.
  • Gemini models become the most stable when equipped with CaR, achieving near-perfect agent completion and execution rates.
  • GPT-5.5 with CaR achieves the highest object recall but has lower completion/execution rates than Gemini variants.
  • Qualitative results (Figs. 3, 4, 5) show CaR produces more complete, structurally sound, and spatially consistent rooms compared to direct generation.

2. Human Evaluation 20 experts rated scenes on similarity, usability, lighting alignment, and acceptability (need for only minor corrections).

Table 2: Human evaluation of overall scene quality.

MethodSim. ↑Use. ↑Light ↑Accept. ↑
(a) Direct Generation Baselines
Gemini3.1-Pro / Single-pass [9]2.00.04.01.0
GPT-5.5 / Single-pass [18]7.06.06.55.0
VIGA [32]5.54.58.04.0
(b) Code-as-Room Variants
CaR w/ GPT-5.5 [18]7.57.08.06.5
CaR w/ Gemini3-Flash [8]8.58.08.07.5
CaR w/ Gemini3.1-Pro [9]9.08.08.07.5

Key Findings:

  • CaR variants consistently outperform baselines in similarity, usability, and acceptability.
  • CaR with Gemini3.1-Pro achieves the best scores, indicating its scenes are most aligned with the input and usable with minimal correction.
  • While VIGA has good lighting, its lower similarity and usability show weaker layout preservation.

3. Scene Re-rendering The generated 3D scenes (with primitive-based geometry) provide strong structural priors. Using GPT-5.5 for image-level re-rendering of these scenes yields more realistic materials and lighting while preserving the original layout and multi-view consistency (Fig. 7), demonstrating the utility of CaR's output as a prior for visual refinement.

4. Ablation Studies Ablations analyze the memory mechanism and visual feedback loop using Gemini3.1-Pro.

Table 3: Ablation study on different components of Code-as-Room.

ConfigurationObj. Recall ↑Layout IoU ↑Rotation Acc. ↑
(a) Effect of Memory Mechanism
w/o Memory48.2%58.0%88.4%
Full Model (Ours)55.5%73.2%93.6%
(b) Effect of Visual Feedback Iterations
w/o Visual Feedback (0 iter.)33.8%64.0%71.9%
Feedback × 335.6%65.7%73.2%
Feedback × 5 (Ours)38.4%66.2%75.4%
Feedback × 1039.1%64.2%72.6%

Key Findings:

  • Removing memory degrades all metrics, especially Layout IoU (-15.2%), showing its critical role in maintaining cross-stage spatial consistency.
  • Visual feedback improves performance up to 5 iterations, correcting omissions and errors. Performance degrades at 10 iterations, suggesting layout drift or over-correction.

Theoretical and Practical Implications

  • Paradigm Shift: Establishes a robust, image-guided paradigm for 3D scene synthesis, moving beyond text-only or unstable agent-based methods.
  • Agentic System Design: Demonstrates the importance of a structured execution harness and persistent memory for complex, multi-step generation tasks, providing a blueprint for building reliable MLLM-based creative agents.
  • Code as a 3D Representation: Validates executable code as a powerful, editable, and interpretable representation for complex 3D scenes, bridging high-level vision understanding with low-level graphics implementation.
  • Practical Applications: The generated editable Blender scenes are directly usable in interior design, VR/AR content creation, game development, and as structured environments for training and testing embodied AI agents.

Conclusion

Code-as-Room presents an effective MLLM-based agentic framework for generating executable 3D room code from top-down images. Its structured multi-stage pipeline, cross-stage memory, and visual feedback loops address the instability of prior methods. The introduced benchmark and comprehensive experiments validate that the proposed harness significantly enhances generation quality and reliability across different MLLM backbones.

Limitations and Future Work:

  1. Currently optimized for top-down views; extending to arbitrary-view inputs would increase general applicability.
  2. Procedural code generation has limits for geometrically complex objects; improved asset retrieval or generation is needed for higher fidelity.
  3. While re-rendering improves visuals, current video models struggle with long, temporally consistent re-rendering.
  4. Future Direction: Exploring video generation models as neural renderers for CaR to produce more realistic and coherent scene visualizations.