Visual Summary | InterleaveThinker: Reinforcing Agentic Interleaved Generation

Summary (Overview)

Novel multi-agent framework: Proposes InterleaveThinker, the first multi-agent pipeline that endows any fixed image generator with robust interleaved generation (text-image sequence) capabilities, addressing the fundamental limitations of Unified Multimodal Models (UMMs).
Addresses key UMM failures: The framework resolves two critical problems in long-horizon tasks—visual over-reliance (stopping at intermediate states that resemble the goal) and step-wise error accumulation (minor early degradations compounding over steps)—by decoupling planning and evaluation into separate agents.
Efficient single-step RL with dual-reward strategy: Introduces a novel dual-reward strategy (accuracy reward + step-wise reward) combined with GRPO to achieve trajectory-level alignment via efficient single-step RL, avoiding prohibitive computational costs of full-trajectory optimization (which can involve >25 generator calls).
State-of-the-art performance: Using 4-step FLUX.2-klein as the generator, InterleaveThinker surpasses existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM) and achieves performance comparable to proprietary models like Nano Banana and GPT-5, while also significantly improving reasoning-based benchmarks (WISE: 0.47 → 0.73; RISE: 13.3 → 28.9).
High-quality dedicated datasets: Constructs three specialized datasets—Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k—through a rigorous data pipeline covering 8+ main categories and 75+ sub-categories.

Introduction and Theoretical Foundation

Background & Motivation

Recent image generators (e.g., FLUX, Stable Diffusion) have achieved remarkable photorealism and instruction-following for single-image generation/editing. However, they are fundamentally constrained by their image-only output architectures and cannot perform interleaved generation—a workflow that takes an interleaved text-image sequence as input and outputs a coherent, multi-step sequence of text and images. This capability is crucial for real-world applications such as:

Visual narratives (e.g., step-by-step storytelling)
Guidance (e.g., how-to tutorials, procedural instructions)
Embodied manipulation (e.g., robotic task execution)

The UMM Problem

Unified Multimodal Models (UMMs) naturally support interleaved generation because they model text and visual tokens within a unified framework. However, they suffer from two critical problems in long-horizon tasks:

Visual over-reliance: Because UMMs generate sequences step-by-step based on preceding images, they become overly conditioned on intermediate visual states. As shown in Figure 2(b), when generating a repetitive action sequence like a push-up, the model might stop at an intermediate state that visually resembles the final goal.
Step-wise error accumulation: Current UMMs lack a stable self-correction mechanism. A slight degradation in early image quality compounds step-by-step, eventually ruining the final output (Figure 2(c)).

Theoretical Basis for Decoupling

The core insight is that if a single VLM alternates between planning and evaluating generated images, it becomes myopically reactive to local visual feedback, losing sight of the global objective. To fundamentally resolve this, InterleaveThinker employs:

A Planner agent that predicts the entire sequence of instructions upfront, completely bypassing visual over-reliance by blocking intermediate visual feedback.
A Critic agent that evaluates step outputs, identifies deviations from the initial instructions, and refines prompts for regeneration, ensuring strict adherence to the overall trajectory without updating the generator.

Methodology

Multi-Agent Pipeline

The framework consists of three core modules operating in a closed-loop:

1. Planner: Analyzes the input sequence S and translates it into an N-step execution plan. For each step (i \in {1, \ldots, N}), the Planner generates a step instruction (u_i), a model-friendly initial prompt (p_i), and an auxiliary text (a_i):

{ (u_i, p_i, a_i) }_{i=1}^N = \text{Planner}(S)

2. Generator: At step (i) and refinement iteration (t \in {1, T_{\max}}), takes the current refined prompt (r_i^t) (where (r_i^0 = p_i)) and the image from the previous step (I_{i-1}) to produce the current image:

I_i^t = \text{Generator}\left(r_i^t, I_{i-1}\right)

For the initial step (i=1), (I_0 = \emptyset).

3. Critic: Evaluates the transition from the pre-execution image (I_{i-1}) to the post-execution image (I_i^t). It outputs a binary judgment (j_i^t), a newly refined prompt (r_i^{t+1}), and a reasoning process (R_i^t):

(j_i^t, r_i^{t+1}, R_i^t) = \text{Critic}\left(I_{i-1}, I_i^t, p_i, r_i^t\right)

This generation-evaluation loop iterates until a positive judgment (True) is obtained or (T_{\max}) iterations are reached.

Dataset Construction Pipeline

A four-stage data pipeline is constructed:

Text Prompt Construction: 8 main categories → ≈75 sub-categories → >30 domain-specific vocabulary banks → >100 instructional templates → ~40,000 diverse interleaved prompts.
Multi-Agent Trajectory Generation: Uses Gemini 2.5 Pro and Nano Banana Pro to generate agentic trajectories with global plans, intermediate images, critiques, and refined prompts.
Critic Data Filtering and Splitting: Rigorous filtering using VIEScore-based evaluation. Steps with high score variance are assigned to RL; stable high-quality steps go to SFT. Class imbalance in binary judgments is addressed via resampling.
Interleaved Input Planner Data Construction: Combines self-synthesized truncated sequences with existing open-source interleaved datasets.

Final datasets: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k.

Training Scheme

Planner-SFT: Initialized from Qwen3-VL-8B-Instruct, fine-tuned on Interleave-Planner-SFT-80k. No RL is applied due to highly sparse reward signals from >25 generator calls.

Critic-SFT: Initialized from Qwen3-VL-8B-Instruct, teaches the basic format of evaluation using Interleave-Critic-SFT-112k.

Critic-RL: Uses GRPO with a novel dual-reward strategy for efficient single-step RL:

Accuracy Reward ((R_{\text{acc}})): Penalizes the difference between the predicted judgment and the ground truth:
$R_{\text{acc}} = -|\text{Critic}(I_{i-1}, I_i^t, p_i, r_i^t) - J_i|$
Step-wise Reward ((R_{\text{step}})): Computes the score difference between the new iteration result and the original:
$R_{\text{step}} = \text{Gemini}(I_{i-1}, I_i^{t+1}, p_i, r_i^{t+1}) - \text{Gemini}(I_{i-1}, I_i^t, p_i, r_i^t)$
Final reward: (R = 0.5 \times R_{\text{format}} + 0.5 \times (\alpha R_{\text{acc}} + (1-\alpha)R_{\text{step}})), with (\alpha = 0.2).

Empirical Validation / Results

Experimental Setup

Base models: Qwen3-VL-8B-Instruct for Planner/Critic; FLUX.2-klein-9B (4-step, for in-domain) and Qwen-Image-Edit-2511 (for generalization) as generators.
Training: SFT for 2 epochs (lr=2×10⁻⁵, batch size 32); RL for 1 epoch (lr=2×10⁻⁶, N=8, KL coefficient=1×10⁻³).
Hardware: ~50 hours on eight H800 GPUs.
Inference: (T_{\max} = 5) maximum refinement iterations per step.

Main Results

Table 1: Comparison on UEval (8 tasks)

Model	Avg Text+Image
Reference	92.2
Gemini-2.0-Flash	55.1
GPT-5-Instant	65.2
GPT-5-Thinking	66.4
Nano Banana Pro	76.1
Emu3.5	49.1
InterleaveThinker+FLUX.2-klein-9B	66.3
InterleaveThinker+Qwen-Image-Edit	67.2

→ Significantly outperforms all open-source UMMs, matching proprietary Nano Banana.

Table 2: Comparison on CoMM (Tasks 3/4)

Model	Avg (6 metrics)
Emu2	~8.0 / ~7.5
DuoGen	- / ~9.2
InterleaveThinker+FLUX.2-klein-9B	9.3 / 9.6
InterleaveThinker+Qwen-Image-Edit	9.2 / 9.6

→ Surpasses all existing methods across all six metrics.

Table 3: Comparison on WISE (Reasoning-based image generation)

Model	Overall
FLUX.2-klein-9B (base)	0.47
+ InterleaveThinker	0.73
Qwen-Image-Edit-2511 (base)	0.60
+ InterleaveThinker	0.72

→ Substantial gains from 0.47 to 0.73 on FLUX.2-klein.

Table 4: Comparison on RISE-Bench (Reasoning-based image editing)

Model	Overall
FLUX.2-klein-9B (base)	13.3
+ InterleaveThinker	28.9
Qwen-Image-Edit-2511 (base)	19.4
+ InterleaveThinker	30.0

→ Major leap from 13.3 to 28.9.

Ablation Study (Table 5 on UEval)

Configuration	Avg
FLUX.2-klein-9B (baseline)	18.2
+ Gemini-2.5-pro (oracle)	77.4
+ Qwen3-VL-8B (zero-shot baseline)	48.1
+ Planner-SFT	60.5
+ Full-SFT	64.5
+ RL w/o step reward	65.2
+ RL w/o acc reward	65.1
+ Full-RL	66.3
One-Agent (entangled)	54.5
Unfiltered data	62.8
(T_{\max} = 1)	60.2
(T_{\max} = 3)	65.3
(T_{\max} = 5)	66.3

Key findings: Multi-agent vs. One-Agent (54.5) confirms that entangled planning+evaluation degrades performance. Data filtering is crucial (unfiltered → 62.8 vs 66.3). Increasing (T_{\max}) consistently improves results.

Theoretical and Practical Implications

Theoretical Implications

Decoupling planning from visual evaluation provides a principled solution to the visual over-reliance problem. The Planner's upfront global planning (Equation 1) completely bypasses the myopic conditioning on intermediate visual states that plagues single-VLM approaches.
Single-step RL with dual rewards demonstrates that trajectory-level alignment can be achieved without full-trajectory optimization. This is theoretically significant for long-horizon agentic tasks where reward signals become extremely sparse (>25 steps). The decomposition relies on the insight that ensuring success at each local iteration guarantees the overall success of the global trajectory.
Emergent reasoning capabilities: Despite being trained exclusively on interleaved generation data (not reasoning tasks), the framework significantly improves performance on reasoning-based benchmarks (WISE: +55%, RISE: +117%). This suggests that the plan-generate-critic paradigm inherently activates chain-of-thought-like reasoning in generation models.

Practical Implications

Universal applicability: The framework is model-agnostic, as demonstrated by consistent performance gains across FLUX.2-klein and Qwen-Image-Edit, and can be integrated with any existing frozen image generator.
Broad application domains: The framework supports visual narratives, how-to guides, robotic manipulation, embodied interaction, and other real-world interleaved tasks (as shown in Figure 1).
Computational efficiency: The single-step RL formulation drastically reduces computational costs compared to full-trajectory RL (which would be prohibitive for >25 generator calls), making the approach practical for deployment.
Limitation: The framework is constrained by the base generator's generative prior—it cannot generate concepts not present in the base model's training corpus (e.g., FLUX.2-klein exhibits color shifts for out-of-domain concepts).

Conclusion

InterleaveThinker is the first multi-agent framework that endows any fixed image generator with robust interleaved generation capabilities. By introducing a decoupled Planner-Gen-Critic workflow, it effectively resolves the visual over-reliance and step-wise error accumulation problems inherent in Unified Multimodal Models.

The framework is supported by three high-quality datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k) and a novel dual-reward strategy (accuracy reward + step-wise reward) that achieves trajectory-level alignment through efficient single-step RL via GRPO.

Extensive experiments demonstrate that InterleaveThinker:

Significantly outperforms existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM)
Achieves performance comparable to proprietary models (Nano Banana, GPT-5)
Surprisingly boosts the base model on reasoning-based benchmarks (WISE: 0.47→0.73; RISE: 13.3→28.9)

Future directions: Addressing the limitation of base generator constraints for out-of-domain concepts, exploring RL for the Planner agent, and extending the framework to video generation and other sequential modalities.