Summary (Overview)
- Novel multi-agent framework: Proposes InterleaveThinker, the first multi-agent pipeline that endows any fixed image generator with robust interleaved generation (text-image sequence) capabilities, addressing the fundamental limitations of Unified Multimodal Models (UMMs).
- Addresses key UMM failures: The framework resolves two critical problems in long-horizon tasks—visual over-reliance (stopping at intermediate states that resemble the goal) and step-wise error accumulation (minor early degradations compounding over steps)—by decoupling planning and evaluation into separate agents.
- Efficient single-step RL with dual-reward strategy: Introduces a novel dual-reward strategy (accuracy reward + step-wise reward) combined with GRPO to achieve trajectory-level alignment via efficient single-step RL, avoiding prohibitive computational costs of full-trajectory optimization (which can involve >25 generator calls).
- State-of-the-art performance: Using 4-step FLUX.2-klein as the generator, InterleaveThinker surpasses existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM) and achieves performance comparable to proprietary models like Nano Banana and GPT-5, while also significantly improving reasoning-based benchmarks (WISE: 0.47 → 0.73; RISE: 13.3 → 28.9).
- High-quality dedicated datasets: Constructs three specialized datasets—Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k—through a rigorous data pipeline covering 8+ main categories and 75+ sub-categories.
Introduction and Theoretical Foundation
Background & Motivation
Recent image generators (e.g., FLUX, Stable Diffusion) have achieved remarkable photorealism and instruction-following for single-image generation/editing. However, they are fundamentally constrained by their image-only output architectures and cannot perform interleaved generation—a workflow that takes an interleaved text-image sequence as input and outputs a coherent, multi-step sequence of text and images. This capability is crucial for real-world applications such as:
- Visual narratives (e.g., step-by-step storytelling)
- Guidance (e.g., how-to tutorials, procedural instructions)
- Embodied manipulation (e.g., robotic task execution)
The UMM Problem
Unified Multimodal Models (UMMs) naturally support interleaved generation because they model text and visual tokens within a unified framework. However, they suffer from two critical problems in long-horizon tasks:
-
Visual over-reliance: Because UMMs generate sequences step-by-step based on preceding images, they become overly conditioned on intermediate visual states. As shown in Figure 2(b), when generating a repetitive action sequence like a push-up, the model might stop at an intermediate state that visually resembles the final goal.
-
Step-wise error accumulation: Current UMMs lack a stable self-correction mechanism. A slight degradation in early image quality compounds step-by-step, eventually ruining the final output (Figure 2(c)).
Theoretical Basis for Decoupling
The core insight is that if a single VLM alternates between planning and evaluating generated images, it becomes myopically reactive to local visual feedback, losing sight of the global objective. To fundamentally resolve this, InterleaveThinker employs:
- A Planner agent that predicts the entire sequence of instructions upfront, completely bypassing visual over-reliance by blocking intermediate visual feedback.
- A Critic agent that evaluates step outputs, identifies deviations from the initial instructions, and refines prompts for regeneration, ensuring strict adherence to the overall trajectory without updating the generator.
Methodology
Multi-Agent Pipeline
The framework consists of three core modules operating in a closed-loop:
1. Planner: Analyzes the input sequence S and translates it into an N-step execution plan. For each step (i \in {1, \ldots, N}), the Planner generates a step instruction (u_i), a model-friendly initial prompt (p_i), and an auxiliary text (a_i):
2. Generator: At step (i) and refinement iteration (t \in {1, T_{\max}}), takes the current refined prompt (r_i^t) (where (r_i^0 = p_i)) and the image from the previous step (I_{i-1}) to produce the current image:
For the initial step (i=1), (I_0 = \emptyset).
3. Critic: Evaluates the transition from the pre-execution image (I_{i-1}) to the post-execution image (I_i^t). It outputs a binary judgment (j_i^t), a newly refined prompt (r_i^{t+1}), and a reasoning process (R_i^t):
This generation-evaluation loop iterates until a positive judgment (True) is obtained or (T_{\max}) iterations are reached.
Dataset Construction Pipeline
A four-stage data pipeline is constructed:
-
Text Prompt Construction: 8 main categories → ≈75 sub-categories → >30 domain-specific vocabulary banks → >100 instructional templates → ~40,000 diverse interleaved prompts.
-
Multi-Agent Trajectory Generation: Uses Gemini 2.5 Pro and Nano Banana Pro to generate agentic trajectories with global plans, intermediate images, critiques, and refined prompts.
-
Critic Data Filtering and Splitting: Rigorous filtering using VIEScore-based evaluation. Steps with high score variance are assigned to RL; stable high-quality steps go to SFT. Class imbalance in binary judgments is addressed via resampling.
-
Interleaved Input Planner Data Construction: Combines self-synthesized truncated sequences with existing open-source interleaved datasets.
Final datasets: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k.
Training Scheme
Planner-SFT: Initialized from Qwen3-VL-8B-Instruct, fine-tuned on Interleave-Planner-SFT-80k. No RL is applied due to highly sparse reward signals from >25 generator calls.
Critic-SFT: Initialized from Qwen3-VL-8B-Instruct, teaches the basic format of evaluation using Interleave-Critic-SFT-112k.
Critic-RL: Uses GRPO with a novel dual-reward strategy for efficient single-step RL:
-
Accuracy Reward ((R_{\text{acc}})): Penalizes the difference between the predicted judgment and the ground truth:
-
Step-wise Reward ((R_{\text{step}})): Computes the score difference between the new iteration result and the original:
-
Final reward: (R = 0.5 \times R_{\text{format}} + 0.5 \times (\alpha R_{\text{acc}} + (1-\alpha)R_{\text{step}})), with (\alpha = 0.2).
Empirical Validation / Results
Experimental Setup
- Base models: Qwen3-VL-8B-Instruct for Planner/Critic; FLUX.2-klein-9B (4-step, for in-domain) and Qwen-Image-Edit-2511 (for generalization) as generators.
- Training: SFT for 2 epochs (lr=2×10⁻⁵, batch size 32); RL for 1 epoch (lr=2×10⁻⁶, N=8, KL coefficient=1×10⁻³).
- Hardware: ~50 hours on eight H800 GPUs.
- Inference: (T_{\max} = 5) maximum refinement iterations per step.
Main Results
Table 1: Comparison on UEval (8 tasks)
| Model | Avg Text+Image |
|---|---|
| Reference | 92.2 |
| Gemini-2.0-Flash | 55.1 |
| GPT-5-Instant | 65.2 |
| GPT-5-Thinking | 66.4 |
| Nano Banana Pro | 76.1 |
| Emu3.5 | 49.1 |
| InterleaveThinker+FLUX.2-klein-9B | 66.3 |
| InterleaveThinker+Qwen-Image-Edit | 67.2 |
→ Significantly outperforms all open-source UMMs, matching proprietary Nano Banana.
Table 2: Comparison on CoMM (Tasks 3/4)
| Model | Avg (6 metrics) |
|---|---|
| Emu2 | ~8.0 / ~7.5 |
| DuoGen | - / ~9.2 |
| InterleaveThinker+FLUX.2-klein-9B | 9.3 / 9.6 |
| InterleaveThinker+Qwen-Image-Edit | 9.2 / 9.6 |
→ Surpasses all existing methods across all six metrics.
Table 3: Comparison on WISE (Reasoning-based image generation)
| Model | Overall |
|---|---|
| FLUX.2-klein-9B (base) | 0.47 |
| + InterleaveThinker | 0.73 |
| Qwen-Image-Edit-2511 (base) | 0.60 |
| + InterleaveThinker | 0.72 |
→ Substantial gains from 0.47 to 0.73 on FLUX.2-klein.
Table 4: Comparison on RISE-Bench (Reasoning-based image editing)
| Model | Overall |
|---|---|
| FLUX.2-klein-9B (base) | 13.3 |
| + InterleaveThinker | 28.9 |
| Qwen-Image-Edit-2511 (base) | 19.4 |
| + InterleaveThinker | 30.0 |
→ Major leap from 13.3 to 28.9.
Ablation Study (Table 5 on UEval)
| Configuration | Avg |
|---|---|
| FLUX.2-klein-9B (baseline) | 18.2 |
| + Gemini-2.5-pro (oracle) | 77.4 |
| + Qwen3-VL-8B (zero-shot baseline) | 48.1 |
| + Planner-SFT | 60.5 |
| + Full-SFT | 64.5 |
| + RL w/o step reward | 65.2 |
| + RL w/o acc reward | 65.1 |
| + Full-RL | 66.3 |
| One-Agent (entangled) | 54.5 |
| Unfiltered data | 62.8 |
| (T_{\max} = 1) | 60.2 |
| (T_{\max} = 3) | 65.3 |
| (T_{\max} = 5) | 66.3 |
Key findings: Multi-agent vs. One-Agent (54.5) confirms that entangled planning+evaluation degrades performance. Data filtering is crucial (unfiltered → 62.8 vs 66.3). Increasing (T_{\max}) consistently improves results.
Theoretical and Practical Implications
Theoretical Implications
-
Decoupling planning from visual evaluation provides a principled solution to the visual over-reliance problem. The Planner's upfront global planning (Equation 1) completely bypasses the myopic conditioning on intermediate visual states that plagues single-VLM approaches.
-
Single-step RL with dual rewards demonstrates that trajectory-level alignment can be achieved without full-trajectory optimization. This is theoretically significant for long-horizon agentic tasks where reward signals become extremely sparse (>25 steps). The decomposition relies on the insight that ensuring success at each local iteration guarantees the overall success of the global trajectory.
-
Emergent reasoning capabilities: Despite being trained exclusively on interleaved generation data (not reasoning tasks), the framework significantly improves performance on reasoning-based benchmarks (WISE: +55%, RISE: +117%). This suggests that the plan-generate-critic paradigm inherently activates chain-of-thought-like reasoning in generation models.
Practical Implications
-
Universal applicability: The framework is model-agnostic, as demonstrated by consistent performance gains across FLUX.2-klein and Qwen-Image-Edit, and can be integrated with any existing frozen image generator.
-
Broad application domains: The framework supports visual narratives, how-to guides, robotic manipulation, embodied interaction, and other real-world interleaved tasks (as shown in Figure 1).
-
Computational efficiency: The single-step RL formulation drastically reduces computational costs compared to full-trajectory RL (which would be prohibitive for >25 generator calls), making the approach practical for deployment.
-
Limitation: The framework is constrained by the base generator's generative prior—it cannot generate concepts not present in the base model's training corpus (e.g., FLUX.2-klein exhibits color shifts for out-of-domain concepts).
Conclusion
InterleaveThinker is the first multi-agent framework that endows any fixed image generator with robust interleaved generation capabilities. By introducing a decoupled Planner-Gen-Critic workflow, it effectively resolves the visual over-reliance and step-wise error accumulation problems inherent in Unified Multimodal Models.
The framework is supported by three high-quality datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k) and a novel dual-reward strategy (accuracy reward + step-wise reward) that achieves trajectory-level alignment through efficient single-step RL via GRPO.
Extensive experiments demonstrate that InterleaveThinker:
- Significantly outperforms existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM)
- Achieves performance comparable to proprietary models (Nano Banana, GPT-5)
- Surprisingly boosts the base model on reasoning-based benchmarks (WISE: 0.47→0.73; RISE: 13.3→28.9)
Future directions: Addressing the limitation of base generator constraints for out-of-domain concepts, exploring RL for the Planner agent, and extending the framework to video generation and other sequential modalities.
Related papers
- Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Role-Agent outperforms baselines by using a single LLM as both agent and environment for bootstrapped co-evolution, with only 5.2% extra computation.
- Audio Interaction Model
Audio-Interaction, a unified streaming audio model, matches offline LALMs on benchmarks while unlocking real-time ASR and proactive assistance.
- Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Z-Reward decouples reasoning-heavy judgment from efficient reward deployment, achieving 89.6% teacher and 88.6% student human