Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Summary (Overview)

Novel Agentic Paradigm: Unify-Agent reframes text-to-image (T2I) generation as an active, inference-time sequential decision process (Think, Research, Recaption, Generate) to address the limitation of relying solely on frozen parametric knowledge for generating long-tail and knowledge-intensive concepts.
Architectural Synergy: The unified multimodal model (UMM) architecture creates a mutually reinforcing synergy between understanding and generation. The joint availability of low-level generative latents (from the VAE) and high-level semantic tokens (from the ViT) enables superior multimodal reasoning during evidence recaptioning.
Superior Performance: Unify-Agent significantly outperforms its base model (Bagel) and other open-source unified models across multiple factual benchmarks (FactIP, WiSE, KiTTEN, T2I-FactualBench), demonstrating strong world knowledge capabilities approaching leading commercial models.
Comprehensive Benchmark: The paper introduces FactIP, a curated benchmark of 2,462 prompts covering 12 categories of culturally significant and long-tail factual concepts, explicitly designed to evaluate identity consistency and factual faithfulness in image generation.
Data-Driven Training: A tailored multimodal data pipeline is constructed, curating 143K high-quality agent trajectories for supervised fine-tuning, enabling effective supervision over the full agentic generation process.

Introduction and Theoretical Foundation

Recent advances in T2I generation have improved visual realism but struggle with faithfully depicting entities grounded in the real world (e.g., real people, cultural symbols, rare IPs). This requires factual and visual fidelity beyond visual plausibility. Unified Multimodal Models (UMMs) unify visual understanding and image generation but are limited by their static parametric knowledge, making them prone to hallucination on rare, long-tail concepts.

The core challenge is that failures on out-of-distribution concepts stem from missing world knowledge, not insufficient visual fidelity. Therefore, the paper argues for moving from closed-book generation (relying on parametric memory) to open-book, agentic generation (accessing external knowledge at inference time).

Existing agentic T2I systems are brittle, multi-stage pipelines that loosely connect an LLM planner, retrieval tools, and a standalone image generator. This decoupling makes effective evidence integration difficult, as retrieved text rarely specifies fine-grained visual attributes, and reference images may conflict with user intent.

The key insight is that generative priors may enhance multimodal understanding. In a unified model, the Vision Transformer (ViT) captures high-level semantics, while the Variational Autoencoder (VAE) provides low-level perceptual latents. Together, they enable better interpretation of visual references and conversion into precise textual specifications for generation.

Preliminary Study: A training-free study on 200 FactIP examples using the base model Bagel compared four inference settings: prompt-only, text injection, visual injection, and text+visual injection. Results (Figure 2) showed:

Both textual and visual knowledge improved over the baseline.
Visual injection yielded substantially larger gains, confirming external knowledge is beneficial.
Combining text and visual inputs was slightly weaker than visual injection alone, indicating naive concatenation is suboptimal.

This motivated the recaption paradigm to transform raw multimodal evidence into a unified, structured, generation-oriented description.

Problem Formulation: Standard T2I models $p_\theta(y | x)$ . For world-grounded synthesis, estimating this directly is intractable due to knowledge deficits. Naively expanding the condition to $p_\theta(y | x, K_{text}, K_{vis})$ with raw retrieved knowledge $K$ leads to suboptimal alignment.

The paper formulates world-grounded image synthesis as an interleaved generative trajectory over an augmented state space, introducing intermediate variables: cognitive gap assessment $g$ , textual evidence trace $\tau_t$ , visual evidence trace $\tau_v$ , and evidence-grounded recaption $c$ . The joint distribution defines the holistic process:

p_\theta(y, c, \tau_t, \tau_v, g | x) = \underbrace{p_\theta(g | x)}_{\text{Gap Detection}} \cdot \underbrace{p_\theta(\tau_t, \tau_v | x, g)}_{\text{Evidence Acquisition}} \cdot \underbrace{p_\theta(c | x, g, \tau_t, \tau_v)}_{\text{Evidence-Grounded Recaptioning}} \cdot \underbrace{p_\theta(y | c, \tau_v)}_{\text{Visual Synthesis}} \tag{3}

This factorization rigorously defines the four cognitive phases of Unify-Agent.

Methodology

Base Model: Bagel

Unify-Agent is built upon Bagel, a UMM with a Mixture-of-Transformers (MoT) architecture integrating a ViT encoder. It disentangles capabilities through dedicated experts:

Multimodal Understanding: Formulated as autoregressive next-token prediction. The training objective minimizes negative log-likelihood: $\mathcal{L}_{\text{text}} = -\sum_{t=1}^{T} \log p_\theta(x_t | x_{<t}, C) \tag{1}$ where $x_t$ is the target text token, $x_{<t}$ is the preceding sequence, and $C$ is the multimodal context.
Multimodal Generation: Formulated as a rectified flow operating in the latent space of a continuous VAE. The model learns a time-conditioned velocity field $u_\theta$ by minimizing the latent flow-matching objective: $\mathcal{L}_{\text{image}} = \mathbb{E}_{t \sim \mathcal{U}(0,1), z_t} \| u_\theta(z_t, t; C) - u^\star(z_t, t) \|_2^2 \tag{2}$ where $t$ is the continuous timestep, $z_t$ is the latent state, $u^\star$ is the target vector field.

Data Pipeline

Training Data Construction

The training corpus $D_{\text{SFT}} = \{ (x, \tau_t, \tau_v, c) \}$ is constructed in three stages:

Task Source and Prompt Collection:
- Collected 456K knowledge-intensive IPs from 12 domains (Celebrity, Animation, Game, Comic, Mythology, Mascot, Animal, Food, Art, Toy, Landmark, Festival).
- For each concept, retrieved webpage info, two representative seed images (ground truth), and used GPT-4o to summarize into structured metadata.
- Generated diverse user instructions using Gemini 3 Pro, covering a spectrum of difficulty requiring both factual consistency and compositional flexibility.
Multimodal Research Trace Construction:
- Used Claude Opus 4.6 as a teacher agent to synthesize supervised agent traces.
- Textual research trace $\tau_t$ : The agent formulates a textual query $q_t \sim p_\theta(q_t | x, g)$ , issued to an external system to obtain textual evidence $E_t = \text{Retrieve}_{\text{text}}(q_t)$ .
- Visual research trace $\tau_v$ : Conditioned on $(x, g, \tau_t)$ , the agent generates a visual query $q_v \sim p_\theta(q_v | x, g, \tau_t)$ to retrieve an initial candidate set $\tilde{E}_v = \text{Retrieve}_{\text{image}}(q_v) = \{v_1, ..., v_n\}$ .
- Visual Selection: Gemini 3 Flash is used as a visual evaluator to score each candidate $v_i$ along four dimensions (identity consistency, subject salience, image clarity, watermark cleanliness). The overall score is computed as: $s(v_i) = \sum_{k=1}^{4} \lambda_k s_k(v_i | x, E_t) \tag{9}$ The top two images are selected as final visual evidence: $E_v = \arg \text{top2}_{v_i \in \tilde{E}_v} s(v_i)$ .
- The final trace is represented as $(\tau_t, \tau_v) = \langle q_t, E_t, q_v, E_v \rangle$ .
Evidence-Grounded Recaption Annotation:
- A recaption $c = \mathcal{C}(x, E_t, E_v)$ is constructed to consolidate the original instruction with retrieved evidence into a generation-oriented description.
- Generation-based Validation: The recaption $c$ and reference images $E_v$ are fed into Nano Banana Pro to synthesize an image $\hat{y} \sim p_\phi(y | c, E_v)$ .
- Reject-Sampling: GPT-4o judges $\hat{y}$ against the ground-truth image. If it fails identity-consistency, the process is re-run (up to 5 trials). Unreliable trajectories are discarded.
- The final set contains 143K high-quality trajectory-image pairs.

FactIP Evaluation Benchmark

Derived from the task pool but strictly separated from training data.
Contains 2,462 samples (with a lightweight 500-sample test split) selected to be long-tail, knowledge-intensive, visually grounded, and difficult for memorization-only models.
Evaluates generated images on four dimensions using Seed2.0 as an expert evaluator: Clarity, Content, Aesthetics, and Relevance. Relevance is prioritized, measuring identity consistency with the target IP.
The overall score is a weighted combination: $\text{Overall Score} = \alpha_1 \cdot \text{Clarity} + \alpha_2 \cdot \text{Content} + \alpha_3 \cdot \text{Aesthetics} + \alpha_4 \cdot \text{Relevance} \tag{24}$ with $\alpha_1=0.05, \alpha_2=0.10, \alpha_3=0.10, \alpha_4=0.75$ , normalized to [0, 100].

Unified Fine-Tuning

The model is adapted through supervised fine-tuning on interleaved multimodal trajectories. The fine-tuning objective combines the language modeling loss and the latent-space regression loss:

\mathcal{L}_{\text{SFT}} = \mathcal{L}_{\text{text}} + \mathcal{L}_{\text{image}} \tag{14}

A hybrid attention masking strategy is employed to regulate information flow, preserving sequential reasoning structure while preventing noisy historical traces from interfering with final image synthesis.

Unify-Agent Inference Pipeline

As illustrated in Figure 4, the inference-time pipeline consists of four stages:

THINK (Prompt Understanding & Gap Detection): The model interprets the prompt $x$ , performs structured decomposition, and estimates a latent gap variable $g \sim p_\theta(g | x)$ . It characterizes missing knowledge as a set of units $M(x) = \{m_1, ..., m_K\}$ (e.g., facial traits, signature details).
RESEARCH (Sequential Multimodal Evidence Acquisition): Conditioned on $g$ , the agent acquires evidence $(\tau_t, \tau_v) \sim p_\theta(\tau_t, \tau_v | x, g)$ . It follows a sequential strategy: textual search first for semantic grounding, then visual search informed by $\tau_t$ for precise, context-compatible reference images.
RECAPTION (Multimodal Grounding into Executable Specifications): Retrieved evidence is transformed into grounded constraints, not passed raw. Two complementary constraint types are derived:
- Identity-preserving constraints: Capture visual attributes faithful to the target identity.
- Scene-compositional constraints: Encode prompt-specified factors (pose, environment, mood). These are integrated into an evidence-grounded recaption $c \sim p_\theta(c | x, g, \tau_t, \tau_v)$ , serving as the final executable specification.
GENERATE (Evidence-Grounded Image Synthesis): The final image is synthesized $y \sim p_\theta(y | c, \tau_v)$ . Generation depends only on the refined recaption $c$ and visual anchors $\tau_v$ , not the full reasoning history, preventing noise interference.

Empirical Validation / Results

Experimental Setup

Baselines: Compared against three categories: leading commercial models (Seedream, Nano Banana-2, DALLE-3, GPT-Image-1.5), generation-only models (FLUX.1-dev, SD-3.5-large, Playground-v2.5, Z-Image, Qwen-Image), and unified MLLMs (Janus-Pro-7B, Emu3.5, Echo-4o, Hunyuan-Image-3.0, Bagel).
Evaluation: Used MLLM-as-a-Judge paradigm. GPT-4o for WiSE, T2I-FactualBench, KiTTEN. Seed2.0 for FactIP.

Main Results

Table 1: FactIP Benchmark Results Unify-Agent achieves the highest Overall score of 73.2 among unified MLLMs, significantly surpassing its base model Bagel (50.9) and strong generation-only baselines like FLUX.1-dev (28.9). It exhibits exceptional performance in the critical Relevance dimension (67.3 Character, 71.8 Object, 78.2 Scene), demonstrating superior identity fidelity.

**Table |

Model	Clarity	Content	Aesthetics	Relevance	Overall
Unify-Agent (Ours)	92.4	75.8/76.1/73.6	83.3/86.0/86.4	67.3/71.8/78.2	73.2
Bagel	91.4	68.5/59.1/65.0	81.6/83.3/87.2	39.9/44.0/50.7	50.9
FLUX.1-dev	92.5	51.9/40.4/37.7	78.4/75.8/80.3	17.0/8.2/18.2	28.9
SD-3.5-large	85.4	44.3/23.2/32.1	75.3/68.1/74.5	18.0/5.9/20.2	27.5
Nano Banana-2	96.6	86.3/96.0/86.6	92.2/92.2/95.3	85.5/93.9/89.5	88.5

Note: Content/Aesthetics/Relevance scores shown as Character/Object/Scene averages. Bold denotes best in Unified MLLM category.

Table 2: WiSE Benchmark Results Unify-Agent attains the best Overall score (0.77) within the Unified MLLM category, exceeding Bagel+CoT (0.70). It excels in cultural (0.82), biological (0.72), and chemistry (0.70) knowledge domains.

Table 3: KiTTEN Benchmark Results Unify-Agent sets a new state-of-the-art among open-source models with an overall score of 4.08, outperforming Imagen-3 (3.50). It achieves the highest text alignment (4.22) and entity alignment (3.93) scores.

Table 4: T2I-FactBench Results Within the unified MLLM group, Unify-Agent achieves top scores in Single Knowledge Concept Instantiation (SKCI, 77.4) and Multiple Concept Composition with Interaction (MKCC, 71.5).

Analysis of Results

Ablation Study (Table 5):

Variant	Clarity	Content	Aesthetics	Relevance	Overall
Baseline (Vanilla Bagel)	91.3	64.2	84.0	44.9	50.9
w/o Text-Search	90.7 (-0.6)	70.9 (+6.7)	84.3 (+0.3)	64.6 (+19.7)	65.4 (+14.5)
w/o Image-Search	92.1 (+0.8)	73.1 (+8.9)	85.0 (+1.0)	50.8 (+5.9)	56.2 (+5.3)
w/o Recaption	83.0 (-8.3)	69.0 (+4.8)	74.5 (-9.5)	60.2 (+15.3)	62.9 (+12.0)
Recaption w/o VAE	90.9 (-0.4)	74.3 (+10.1)	84.5 (+0.5)	70.8 (+25.9)	71.2 (+20.3)
Recaption w/o ViT	88.6 (-2.7)	68.4 (+4.2)	81.1 (-2.9)	58.7 (+13.8)	61.4 (+10.5)
Unify-Agent (Full)	91.2 (-0.1)	75.2 (+11.0)	85.2 (+1.2)	72.4 (+27.5)	73.2 (+22.3)

Key findings:

The full model achieves the best performance, with the largest gain on Relevance (44.9 → 72.4).
Visual search is more critical than text search for identity fidelity (Relevance drops sharply without it).
Recaptioning is essential; using raw evidence is suboptimal.
Both VAE (low-level latents) and ViT (high-level semantics) contribute to the recaptioning quality, with ViT having a larger impact. This demonstrates that generation helps understanding in a unified model—the joint availability of generative and understanding representations enables superior multimodal reasoning.

Qualitative Results (Figure 5): Visual