CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

Summary (Overview)

  • Code-as-CoT Framework: Introduces CoCo, a novel text-to-image generation framework that uses executable code as an explicit Chain-of-Thought (CoT) intermediate representation. This replaces abstract natural-language planning with deterministic, verifiable code specifying spatial layouts and structural elements.
  • Three-Stage Pipeline: The generation process is decomposed into: 1) Code Generation from the text prompt, 2) Draft Image Rendering by executing the code in a sandbox, and 3) Draft-Guided Refinement to enhance visual fidelity while preserving the draft's semantic structure.
  • CoCo-10K Dataset: Constructs a curated dataset of over 10K samples with Text-Code pairs and Text-Draft Image-Final Image triplets. This dataset teaches the model both structured draft construction (via code) and corrective visual refinement.
  • Superior Performance: Empirical evaluations show CoCo substantially outperforms direct generation and other CoT-based methods on structured and text-intensive benchmarks. It achieves improvements of +68.83% on StructT2IBench, +54.8% on OneIG-Bench, and +41.23% on LongText-Bench.
  • Enhanced Controllability: The code-based reasoning enables precise control over layouts, object placement, and text rendering, leading to high accuracy in generating charts, graphs, mathematical figures, and tables.

Introduction and Theoretical Foundation

Recent Unified Multimodal Models (UMMs) have advanced text-to-image (T2I) generation, often integrating Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements (e.g., scientific diagrams), and dense textual content.

Natural language is inherently ambiguous and cannot explicitly encode precise coordinates, constraints, or relationships. This leads to models struggling with prompts like “2D plot of y=x2y = x^2”, producing incorrect structures or illegible text. In contrast, executable code offers a deterministic and verifiable form of planning. Code can explicitly specify layouts, constraints, and placements. Executing this code produces a concrete draft image that instantiates the reasoning outcome, making the planned structure directly observable. This draft then serves as a visual scaffold for targeted refinement.

The paper proposes that executable code can serve as a more precise and verifiable form of CoT, bridging the gap between semantic intent and visual realization for structured generation. This motivates the development of CoCo (Code-as-CoT).

Methodology

The framework builds upon the UMM Bagel (Deng et al., 2025), which uses a SigLIP ViT encoder for understanding, a VAE encoder for generation, and a Mixture-of-Transformer-Experts (MoT) with specialized branches for VAE tokens (generation) and ViT/text tokens (understanding). CoCo's pipeline, illustrated in Figure 2, consists of three stages:

  1. Code Generation: Given a user prompt pp, the model generates executable code cc that deterministically specifies the core semantic structure of the target image (layouts, object relationships, text, canvas config). It focuses on essential structure, deferring fine-grained details to refinement.
  2. Draft Image Rendering: The code cc is executed in a sandbox environment to produce a draft image IdI_d. This step concretely visualizes the programmatic reasoning.
  3. Draft-Guided Refinement: The draft IdI_d provides an accurate structural scaffold but may be visually simplistic. To produce the final high-fidelity image, IdI_d is encoded by both the ViT encoder (for high-level semantic features) and the VAE encoder (for low-level details) and fed back into the UMM. The model then performs fine-grained image editing to enhance realism while preserving the draft's structural semantics.

CoCo-10K Dataset Construction

Existing datasets lack supervision for code generation and fine-grained correction signals. CoCo-10K is constructed to address this, targeting three atomic correction capabilities. Its construction pipeline is shown in Figure 4.

  • Editing Dataset (2.5K samples): Built from StructVisuals (Zhuo et al., 2025). Uses original (A-Image) and corrected (B-Image) chart pairs to teach the model to preserve structured perception and perform precise corrections without disrupting the underlying layout.
  • Synthesis Dataset (7.5K samples): Focuses on generating complex structured visuals (scientific diagrams, dense text images). A diverse set of prompts is synthesized. Gemini-3-Pro generates corresponding code, which is executed in a sandbox to render the A-Image (draft). Nano Banana then refines this draft, conditioned on the prompt and A-Image, to produce the high-fidelity B-Image (final). This creates paired supervision: Prompt → Code → Draft (A-Image) → Final (B-Image).

The final training set is organized into two formats:

  1. Text–Code pairs
  2. Text–Draft Image–Final Image triplets

Training Loss

The model is fine-tuned from Bagel. The training data is sequentially input as: prompt tokens, ViT features of draft images, verifications, and noisy VAE tokens of the final image. Two loss functions are used:

  • Code Loss: Token-level cross-entropy loss on the generated code.

    Lcode=1vi=1vlog(vi)L_{\text{code}} = -\frac{1}{|v|} \sum_{i=1}^{|v|} \log(v_i)

    where vv represents the code tokens.

  • Final Image Loss: Mean Squared Error (MSE) on the VAE tokens, following Rectified Flow.

    Lfinal image=Et,x0,x1[m(t,xt)(x1x0)2]L_{\text{final image}} = \mathbb{E}_{t,x_0,x_1} \left[ \| m(t, x_t) - (x_1 - x_0) \|^2 \right]

Training involves a preliminary text-to-code fine-tuning stage to equip the model with basic code generation capability, followed by full-parameter fine-tuning for 16K steps.

Empirical Validation / Results

Evaluations are conducted on three benchmarks: StructT2IBench (structured visuals), OneIG-Bench (text rendering), and LongText-Bench (long-form text generation).

Quantitative Results on StructT2IBench

Table 1 shows CoCo's performance compared to close-resource and open-resource models, including generation-only models, UMMs, and UMMs with CoT planning.

ModelChart ↑Graph ↑Math ↑Puzzle ↑Science ↑Table ↑Overall ↑
Close-resource models
Seedream 4.035.7954.0863.3350.8962.5968.9447.52
Nano banana35.5558.9664.8163.8760.7567.2048.45
GPT-Image37.0957.0063.2559.4260.9483.3149.58
Open-resource models
UniWorld-V11.715.524.721.588.825.253.20
Bagel4.663.614.024.468.605.744.69
Bagel-Think4.8115.3313.8915.2219.058.979.03
HiDream-I1-Full9.4720.8419.2018.0026.7727.0514.77
OmniGen210.6722.5122.8918.6328.0022.6116.24
FLUX.1 Dev12.3520.0919.8620.6325.2527.0016.51
FLUX.1 Kontext17.2224.6421.4224.0630.9729.1620.36
Ovis-U124.7516.0819.4521.2326.0312.7022.83
Qwen-Image32.2348.0546.9848.9053.5173.6541.03
CoCo (Ours)79.4462.5869.1249.1058.8179.1573.52

Key Findings:

  • CoCo achieves an overall accuracy of 73.52%, significantly surpassing the best baseline (GPT-Image, 49.58%) by +68.83%.
  • It achieves top performance on structurally demanding tasks: Chart (79.44%), Graph (62.58%), Math (69.12%), and Table (79.15%), demonstrating the effectiveness of code-based reasoning for precise layouts.

Quantitative Results on Text Rendering Benchmarks

Table 2 compares text rendering capabilities on OneIG-Bench and LongText-Bench.

| Method | OneIG-Bench | LongText-Bench | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | | English | Chinese | Overall | English | Chinese | Overall | | Gen. Only Models | | | | | | | | FLUX.1-dev | 0.523 | - | - | 0.607 | 0.005 | 0.306 | | HiDream-I1 | 0.707 | 0.205 | 0.456 | 0.543 | 0.024 | 0.284 | | Kolors 2.0 | 0.427 | 0.502 | 0.465 | 0.258 | 0.329 | 0.294 | | Unified Models | | | | | | | | Janus-Pro | 0.019 | 0.015 | 0.017 | 0.019 | 0.006 | 0.013 | | BLIP3-o | 0.013 | 0.092 | 0.053 | 0.021 | 0.018 | 0.020 | | OmniGen2 | 0.680 | - | - | 0.561 | 0.059 | 0.310 | | Show-o2 | 0.002 | - | - | 0.006 | 0.002 | 0.004 | | BAGEL | 0.244 | 0.365 | 0.305 | 0.373 | 0.310 | 0.342 | | GPT-4o | 0.857 | 0.650 | 0.754 | 0.956 | 0.619 | 0.788 | | Unified Model w/ CoT | | | | | | | | BAGEL-thinking | 0.020 | 0.127 | 0.074 | 0.068 | 0.105 | 0.087 | | CoCo(Ours) | 0.895 | 0.811 | 0.853 | 0.755 | 0.753 | 0.754 |

Key Findings:

  • On OneIG-Bench, CoCo scores 0.853 overall, outperforming all compared methods.
  • On LongText-Bench, CoCo scores 0.754 overall, demonstrating competitive performance on long-text rendering, validating that code-based representations enable reliable handling of complex textual instructions.

Qualitative Results

Figure 6 provides a contrastive visualization. CoCo generates accurate layouts and high-fidelity text (e.g., correct mathematical plots, legible menu text, proper advertisement layout), outperforming baselines like Bagel and Bagel-Thinking. It also supports adaptive aspect ratios (e.g., wider layouts for posters, square canvases for charts), indicating the model learns to dynamically adjust canvas parameters based on prompt semantics, not just memorizing a fixed configuration.

Ablation Study

Effect of Training Mixture Ratios (rcr_c): Table 3 examines the proportion of Text–Code supervision (rcr_c) versus Text-Draft-Final supervision.

Methodrcr_cEnglishChinese
Bagel0.3730.310
CoCo0.200.7240.667
CoCo0.100.7330.671
CoCo0.050.7550.753

Key Findings:

  • Performance improves as rcr_c decreases. The best results are at rc=0.05r_c = 0.05.
  • This indicates only a small amount of code supervision is needed to induce structured reasoning, while the dominant training signal should come from draft-to-final refinement data to support faithful and semantically accurate rendering.

Is Text-Code Supervision Necessary? Figure 7 compares code executability between the off-the-shelf Bagel model and CoCo on LongText-Bench.

  • Bagel: Without Text–Code training, only 9.06% (29/320) of generated programs compile successfully.
  • CoCo: After fine-tuning with the CoCo pipeline, achieves a 100% compilation success rate.

This confirms that Text–Code supervision is essential for teaching the model to generate executable code, which is a prerequisite for stable rendering.

Theoretical and Practical Implications

  • Precision and Controllability: Code-as-CoT provides a deterministic, verifiable intermediate representation, enabling explicit control over spatial layouts, object relationships, and text placement. This addresses the inherent ambiguity of natural-language CoT.
  • Interpretable Reasoning: The executable code makes the reasoning process transparent and inspectable. The draft image serves as a concrete, observable outcome of the planning stage.
  • Structured Generation Paradigm: CoCo establishes a new paradigm for generating complex structured visuals (charts, diagrams, text-heavy images) by decomposing the problem into structured planning (code) and detail refinement (editing).
  • Dataset Design Principle: The construction of CoCo-10K demonstrates the importance of paired supervision (code-draft-final) for teaching models both layout planning and selective editing, correcting errors while preserving correct structures.
  • Generalization and Flexibility: The model learns to adapt canvas size and layout parameters dynamically based on prompt semantics, indicating it internalizes the programmatic nature of the reasoning rather than memorizing fixed outputs.

Conclusion

CoCo introduces a code-driven reasoning framework that uses executable code as Chain-of-Thought for text-to-image generation. It replaces abstract natural-language planning with structured code that explicitly specifies layouts and elements. This code is executed to produce a draft image, which is then refined via fine-grained editing into the final high-fidelity result.

The supporting CoCo-10K dataset provides the necessary supervision for learning both code generation and draft-guided refinement. Extensive experiments demonstrate substantial improvements on challenging structured and text-intensive benchmarks, outperforming direct generation and other CoT-based methods.

The results highlight executable code as a reliable intermediate reasoning representation for precise, controllable, and structured text-to-image generation, offering a promising direction for enhancing the capabilities of Unified Multimodal Models.