Summary (Overview)

  • Novel multi-agent framework: Proposes InterleaveThinker, the first multi-agent pipeline that endows any fixed image generator with robust interleaved generation (text-image sequence) capabilities, addressing the fundamental limitations of Unified Multimodal Models (UMMs).
  • Addresses key UMM failures: The framework resolves two critical problems in long-horizon tasks—visual over-reliance (stopping at intermediate states that resemble the goal) and step-wise error accumulation (minor early degradations compounding over steps)—by decoupling planning and evaluation into separate agents.
  • Efficient single-step RL with dual-reward strategy: Introduces a novel dual-reward strategy (accuracy reward + step-wise reward) combined with GRPO to achieve trajectory-level alignment via efficient single-step RL, avoiding prohibitive computational costs of full-trajectory optimization (which can involve >25 generator calls).
  • State-of-the-art performance: Using 4-step FLUX.2-klein as the generator, InterleaveThinker surpasses existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM) and achieves performance comparable to proprietary models like Nano Banana and GPT-5, while also significantly improving reasoning-based benchmarks (WISE: 0.47 → 0.73; RISE: 13.3 → 28.9).
  • High-quality dedicated datasets: Constructs three specialized datasets—Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k—through a rigorous data pipeline covering 8+ main categories and 75+ sub-categories.

Introduction and Theoretical Foundation

Background & Motivation

Recent image generators (e.g., FLUX, Stable Diffusion) have achieved remarkable photorealism and instruction-following for single-image generation/editing. However, they are fundamentally constrained by their image-only output architectures and cannot perform interleaved generation—a workflow that takes an interleaved text-image sequence as input and outputs a coherent, multi-step sequence of text and images. This capability is crucial for real-world applications such as:

  • Visual narratives (e.g., step-by-step storytelling)
  • Guidance (e.g., how-to tutorials, procedural instructions)
  • Embodied manipulation (e.g., robotic task execution)

The UMM Problem

Unified Multimodal Models (UMMs) naturally support interleaved generation because they model text and visual tokens within a unified framework. However, they suffer from two critical problems in long-horizon tasks:

  1. Visual over-reliance: Because UMMs generate sequences step-by-step based on preceding images, they become overly conditioned on intermediate visual states. As shown in Figure 2(b), when generating a repetitive action sequence like a push-up, the model might stop at an intermediate state that visually resembles the final goal.

  2. Step-wise error accumulation: Current UMMs lack a stable self-correction mechanism. A slight degradation in early image quality compounds step-by-step, eventually ruining the final output (Figure 2(c)).

Theoretical Basis for Decoupling

The core insight is that if a single VLM alternates between planning and evaluating generated images, it becomes myopically reactive to local visual feedback, losing sight of the global objective. To fundamentally resolve this, InterleaveThinker employs:

  • A Planner agent that predicts the entire sequence of instructions upfront, completely bypassing visual over-reliance by blocking intermediate visual feedback.
  • A Critic agent that evaluates step outputs, identifies deviations from the initial instructions, and refines prompts for regeneration, ensuring strict adherence to the overall trajectory without updating the generator.

Methodology

Multi-Agent Pipeline

The framework consists of three core modules operating in a closed-loop:

1. Planner: Analyzes the input sequence S and translates it into an N-step execution plan. For each step (i \in {1, \ldots, N}), the Planner generates a step instruction (u_i), a model-friendly initial prompt (p_i), and an auxiliary text (a_i):

(ui,pi,ai)i=1N=Planner(S){ (u_i, p_i, a_i) }_{i=1}^N = \text{Planner}(S)

2. Generator: At step (i) and refinement iteration (t \in {1, T_{\max}}), takes the current refined prompt (r_i^t) (where (r_i^0 = p_i)) and the image from the previous step (I_{i-1}) to produce the current image:

Iit=Generator(rit,Ii1)I_i^t = \text{Generator}\left(r_i^t, I_{i-1}\right)

For the initial step (i=1), (I_0 = \emptyset).

3. Critic: Evaluates the transition from the pre-execution image (I_{i-1}) to the post-execution image (I_i^t). It outputs a binary judgment (j_i^t), a newly refined prompt (r_i^{t+1}), and a reasoning process (R_i^t):

(jit,rit+1,Rit)=Critic(Ii1,Iit,pi,rit)(j_i^t, r_i^{t+1}, R_i^t) = \text{Critic}\left(I_{i-1}, I_i^t, p_i, r_i^t\right)

This generation-evaluation loop iterates until a positive judgment (True) is obtained or (T_{\max}) iterations are reached.

Dataset Construction Pipeline

A four-stage data pipeline is constructed:

  1. Text Prompt Construction: 8 main categories → ≈75 sub-categories → >30 domain-specific vocabulary banks → >100 instructional templates → ~40,000 diverse interleaved prompts.

  2. Multi-Agent Trajectory Generation: Uses Gemini 2.5 Pro and Nano Banana Pro to generate agentic trajectories with global plans, intermediate images, critiques, and refined prompts.

  3. Critic Data Filtering and Splitting: Rigorous filtering using VIEScore-based evaluation. Steps with high score variance are assigned to RL; stable high-quality steps go to SFT. Class imbalance in binary judgments is addressed via resampling.

  4. Interleaved Input Planner Data Construction: Combines self-synthesized truncated sequences with existing open-source interleaved datasets.

Final datasets: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k.

Training Scheme

Planner-SFT: Initialized from Qwen3-VL-8B-Instruct, fine-tuned on Interleave-Planner-SFT-80k. No RL is applied due to highly sparse reward signals from >25 generator calls.

Critic-SFT: Initialized from Qwen3-VL-8B-Instruct, teaches the basic format of evaluation using Interleave-Critic-SFT-112k.

Critic-RL: Uses GRPO with a novel dual-reward strategy for efficient single-step RL:

  • Accuracy Reward ((R_{\text{acc}})): Penalizes the difference between the predicted judgment and the ground truth:

    Racc=Critic(Ii1,Iit,pi,rit)JiR_{\text{acc}} = -|\text{Critic}(I_{i-1}, I_i^t, p_i, r_i^t) - J_i|
  • Step-wise Reward ((R_{\text{step}})): Computes the score difference between the new iteration result and the original:

    Rstep=Gemini(Ii1,Iit+1,pi,rit+1)Gemini(Ii1,Iit,pi,rit)R_{\text{step}} = \text{Gemini}(I_{i-1}, I_i^{t+1}, p_i, r_i^{t+1}) - \text{Gemini}(I_{i-1}, I_i^t, p_i, r_i^t)
  • Final reward: (R = 0.5 \times R_{\text{format}} + 0.5 \times (\alpha R_{\text{acc}} + (1-\alpha)R_{\text{step}})), with (\alpha = 0.2).

Empirical Validation / Results

Experimental Setup

  • Base models: Qwen3-VL-8B-Instruct for Planner/Critic; FLUX.2-klein-9B (4-step, for in-domain) and Qwen-Image-Edit-2511 (for generalization) as generators.
  • Training: SFT for 2 epochs (lr=2×10⁻⁵, batch size 32); RL for 1 epoch (lr=2×10⁻⁶, N=8, KL coefficient=1×10⁻³).
  • Hardware: ~50 hours on eight H800 GPUs.
  • Inference: (T_{\max} = 5) maximum refinement iterations per step.

Main Results

Table 1: Comparison on UEval (8 tasks)

ModelAvg Text+Image
Reference92.2
Gemini-2.0-Flash55.1
GPT-5-Instant65.2
GPT-5-Thinking66.4
Nano Banana Pro76.1
Emu3.549.1
InterleaveThinker+FLUX.2-klein-9B66.3
InterleaveThinker+Qwen-Image-Edit67.2

→ Significantly outperforms all open-source UMMs, matching proprietary Nano Banana.

Table 2: Comparison on CoMM (Tasks 3/4)

ModelAvg (6 metrics)
Emu2~8.0 / ~7.5
DuoGen- / ~9.2
InterleaveThinker+FLUX.2-klein-9B9.3 / 9.6
InterleaveThinker+Qwen-Image-Edit9.2 / 9.6

→ Surpasses all existing methods across all six metrics.

Table 3: Comparison on WISE (Reasoning-based image generation)

ModelOverall
FLUX.2-klein-9B (base)0.47
+ InterleaveThinker0.73
Qwen-Image-Edit-2511 (base)0.60
+ InterleaveThinker0.72

→ Substantial gains from 0.47 to 0.73 on FLUX.2-klein.

Table 4: Comparison on RISE-Bench (Reasoning-based image editing)

ModelOverall
FLUX.2-klein-9B (base)13.3
+ InterleaveThinker28.9
Qwen-Image-Edit-2511 (base)19.4
+ InterleaveThinker30.0

→ Major leap from 13.3 to 28.9.

Ablation Study (Table 5 on UEval)

ConfigurationAvg
FLUX.2-klein-9B (baseline)18.2
+ Gemini-2.5-pro (oracle)77.4
+ Qwen3-VL-8B (zero-shot baseline)48.1
+ Planner-SFT60.5
+ Full-SFT64.5
+ RL w/o step reward65.2
+ RL w/o acc reward65.1
+ Full-RL66.3
One-Agent (entangled)54.5
Unfiltered data62.8
(T_{\max} = 1)60.2
(T_{\max} = 3)65.3
(T_{\max} = 5)66.3

Key findings: Multi-agent vs. One-Agent (54.5) confirms that entangled planning+evaluation degrades performance. Data filtering is crucial (unfiltered → 62.8 vs 66.3). Increasing (T_{\max}) consistently improves results.

Theoretical and Practical Implications

Theoretical Implications

  1. Decoupling planning from visual evaluation provides a principled solution to the visual over-reliance problem. The Planner's upfront global planning (Equation 1) completely bypasses the myopic conditioning on intermediate visual states that plagues single-VLM approaches.

  2. Single-step RL with dual rewards demonstrates that trajectory-level alignment can be achieved without full-trajectory optimization. This is theoretically significant for long-horizon agentic tasks where reward signals become extremely sparse (>25 steps). The decomposition relies on the insight that ensuring success at each local iteration guarantees the overall success of the global trajectory.

  3. Emergent reasoning capabilities: Despite being trained exclusively on interleaved generation data (not reasoning tasks), the framework significantly improves performance on reasoning-based benchmarks (WISE: +55%, RISE: +117%). This suggests that the plan-generate-critic paradigm inherently activates chain-of-thought-like reasoning in generation models.

Practical Implications

  1. Universal applicability: The framework is model-agnostic, as demonstrated by consistent performance gains across FLUX.2-klein and Qwen-Image-Edit, and can be integrated with any existing frozen image generator.

  2. Broad application domains: The framework supports visual narratives, how-to guides, robotic manipulation, embodied interaction, and other real-world interleaved tasks (as shown in Figure 1).

  3. Computational efficiency: The single-step RL formulation drastically reduces computational costs compared to full-trajectory RL (which would be prohibitive for >25 generator calls), making the approach practical for deployment.

  4. Limitation: The framework is constrained by the base generator's generative prior—it cannot generate concepts not present in the base model's training corpus (e.g., FLUX.2-klein exhibits color shifts for out-of-domain concepts).

Conclusion

InterleaveThinker is the first multi-agent framework that endows any fixed image generator with robust interleaved generation capabilities. By introducing a decoupled Planner-Gen-Critic workflow, it effectively resolves the visual over-reliance and step-wise error accumulation problems inherent in Unified Multimodal Models.

The framework is supported by three high-quality datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k) and a novel dual-reward strategy (accuracy reward + step-wise reward) that achieves trajectory-level alignment through efficient single-step RL via GRPO.

Extensive experiments demonstrate that InterleaveThinker:

  1. Significantly outperforms existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM)
  2. Achieves performance comparable to proprietary models (Nano Banana, GPT-5)
  3. Surprisingly boosts the base model on reasoning-based benchmarks (WISE: 0.47→0.73; RISE: 13.3→28.9)

Future directions: Addressing the limitation of base generator constraints for out-of-domain concepts, exploring RL for the Planner agent, and extending the framework to video generation and other sequential modalities.

Related papers