# InterleaveThinker: Reinforcing Agentic Interleaved Generation

> InterleaveThinker uses decoupled planner-critic agents to enable any frozen image generator to achieve state-of-the-art interleaved generation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.13679)
- **Published:** 2026-06-13
- **Permalink:** https://picx.dev/p/lE4y4J
- **Whiteboard:** https://picx.dev/p/lE4y4J/image

## Summary

## Summary (Overview)

- **Novel multi-agent framework**: Proposes InterleaveThinker, the first multi-agent pipeline that endows any fixed image generator with robust interleaved generation (text-image sequence) capabilities, addressing the fundamental limitations of Unified Multimodal Models (UMMs).
- **Addresses key UMM failures**: The framework resolves two critical problems in long-horizon tasks—visual over-reliance (stopping at intermediate states that resemble the goal) and step-wise error accumulation (minor early degradations compounding over steps)—by decoupling planning and evaluation into separate agents.
- **Efficient single-step RL with dual-reward strategy**: Introduces a novel dual-reward strategy (accuracy reward + step-wise reward) combined with GRPO to achieve trajectory-level alignment via efficient single-step RL, avoiding prohibitive computational costs of full-trajectory optimization (which can involve >25 generator calls).
- **State-of-the-art performance**: Using 4-step FLUX.2-klein as the generator, InterleaveThinker surpasses existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM) and achieves performance comparable to proprietary models like Nano Banana and GPT-5, while also significantly improving reasoning-based benchmarks (WISE: 0.47 → 0.73; RISE: 13.3 → 28.9).
- **High-quality dedicated datasets**: Constructs three specialized datasets—Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k—through a rigorous data pipeline covering 8+ main categories and 75+ sub-categories.

## Introduction and Theoretical Foundation

### Background & Motivation

Recent image generators (e.g., FLUX, Stable Diffusion) have achieved remarkable photorealism and instruction-following for **single-image generation/editing**. However, they are fundamentally constrained by their image-only output architectures and cannot perform **interleaved generation**—a workflow that takes an interleaved text-image sequence as input and outputs a coherent, multi-step sequence of text and images. This capability is crucial for real-world applications such as:

- Visual narratives (e.g., step-by-step storytelling)
- Guidance (e.g., how-to tutorials, procedural instructions)
- Embodied manipulation (e.g., robotic task execution)

### The UMM Problem

Unified Multimodal Models (UMMs) naturally support interleaved generation because they model text and visual tokens within a unified framework. However, they suffer from two critical problems in long-horizon tasks:

1. **Visual over-reliance**: Because UMMs generate sequences step-by-step based on preceding images, they become overly conditioned on intermediate visual states. As shown in Figure 2(b), when generating a repetitive action sequence like a push-up, the model might stop at an intermediate state that visually resembles the final goal.

2. **Step-wise error accumulation**: Current UMMs lack a stable self-correction mechanism. A slight degradation in early image quality compounds step-by-step, eventually ruining the final output (Figure 2(c)).

### Theoretical Basis for Decoupling

The core insight is that if a single VLM alternates between planning and evaluating generated images, it becomes myopically reactive to local visual feedback, losing sight of the global objective. To fundamentally resolve this, InterleaveThinker employs:

- A **Planner agent** that predicts the *entire* sequence of instructions upfront, completely bypassing visual over-reliance by blocking intermediate visual feedback.
- A **Critic agent** that evaluates step outputs, identifies deviations from the initial instructions, and refines prompts for regeneration, ensuring strict adherence to the overall trajectory without updating the generator.

## Methodology

### Multi-Agent Pipeline

The framework consists of three core modules operating in a closed-loop:

**1. Planner**: Analyzes the input sequence S and translates it into an N-step execution plan. For each step \(i \in \{1, \ldots, N\}\), the Planner generates a step instruction \(u_i\), a model-friendly initial prompt \(p_i\), and an auxiliary text \(a_i\):

$${ (u_i, p_i, a_i) }_{i=1}^N = \text{Planner}(S)$$

**2. Generator**: At step \(i\) and refinement iteration \(t \in \{1, T_{\max}\}\), takes the current refined prompt \(r_i^t\) (where \(r_i^0 = p_i\)) and the image from the previous step \(I_{i-1}\) to produce the current image:

$$I_i^t = \text{Generator}\left(r_i^t, I_{i-1}\right)$$

For the initial step (i=1), \(I_0 = \emptyset\).

**3. Critic**: Evaluates the transition from the pre-execution image \(I_{i-1}\) to the post-execution image \(I_i^t\). It outputs a binary judgment \(j_i^t\), a newly refined prompt \(r_i^{t+1}\), and a reasoning process \(R_i^t\):

$$(j_i^t, r_i^{t+1}, R_i^t) = \text{Critic}\left(I_{i-1}, I_i^t, p_i, r_i^t\right)$$

This generation-evaluation loop iterates until a positive judgment (True) is obtained or \(T_{\max}\) iterations are reached.

### Dataset Construction Pipeline

A four-stage data pipeline is constructed:

1. **Text Prompt Construction**: 8 main categories → ≈75 sub-categories → >30 domain-specific vocabulary banks → >100 instructional templates → ~40,000 diverse interleaved prompts.

2. **Multi-Agent Trajectory Generation**: Uses Gemini 2.5 Pro and Nano Banana Pro to generate agentic trajectories with global plans, intermediate images, critiques, and refined prompts.

3. **Critic Data Filtering and Splitting**: Rigorous filtering using VIEScore-based evaluation. Steps with high score variance are assigned to RL; stable high-quality steps go to SFT. Class imbalance in binary judgments is addressed via resampling.

4. **Interleaved Input Planner Data Construction**: Combines self-synthesized truncated sequences with existing open-source interleaved datasets.

**Final datasets**: Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k.

### Training Scheme

**Planner-SFT**: Initialized from Qwen3-VL-8B-Instruct, fine-tuned on Interleave-Planner-SFT-80k. No RL is applied due to highly sparse reward signals from >25 generator calls.

**Critic-SFT**: Initialized from Qwen3-VL-8B-Instruct, teaches the basic format of evaluation using Interleave-Critic-SFT-112k.

**Critic-RL**: Uses GRPO with a novel **dual-reward strategy** for efficient single-step RL:

- **Accuracy Reward (\(R_{\text{acc}}\))**: Penalizes the difference between the predicted judgment and the ground truth:
  $$R_{\text{acc}} = -|\text{Critic}(I_{i-1}, I_i^t, p_i, r_i^t) - J_i|$$

- **Step-wise Reward (\(R_{\text{step}}\))**: Computes the score difference between the new iteration result and the original:
  $$R_{\text{step}} = \text{Gemini}(I_{i-1}, I_i^{t+1}, p_i, r_i^{t+1}) - \text{Gemini}(I_{i-1}, I_i^t, p_i, r_i^t)$$

- **Final reward**: \(R = 0.5 \times R_{\text{format}} + 0.5 \times (\alpha R_{\text{acc}} + (1-\alpha)R_{\text{step}})\), with \(\alpha = 0.2\).

## Empirical Validation / Results

### Experimental Setup

- **Base models**: Qwen3-VL-8B-Instruct for Planner/Critic; FLUX.2-klein-9B (4-step, for in-domain) and Qwen-Image-Edit-2511 (for generalization) as generators.
- **Training**: SFT for 2 epochs (lr=2×10⁻⁵, batch size 32); RL for 1 epoch (lr=2×10⁻⁶, N=8, KL coefficient=1×10⁻³).
- **Hardware**: ~50 hours on eight H800 GPUs.
- **Inference**: \(T_{\max} = 5\) maximum refinement iterations per step.

### Main Results

**Table 1: Comparison on UEval (8 tasks)**

| Model | Avg Text+Image |
|-------|----------------|
| Reference | 92.2 |
| Gemini-2.0-Flash | 55.1 |
| GPT-5-Instant | 65.2 |
| GPT-5-Thinking | 66.4 |
| Nano Banana Pro | 76.1 |
| Emu3.5 | 49.1 |
| **InterleaveThinker+FLUX.2-klein-9B** | **66.3** |
| **InterleaveThinker+Qwen-Image-Edit** | **67.2** |

→ Significantly outperforms all open-source UMMs, matching proprietary Nano Banana.

**Table 2: Comparison on CoMM (Tasks 3/4)**

| Model | Avg (6 metrics) |
|-------|----------------|
| Emu2 | ~8.0 / ~7.5 |
| DuoGen | - / ~9.2 |
| **InterleaveThinker+FLUX.2-klein-9B** | **9.3 / 9.6** |
| **InterleaveThinker+Qwen-Image-Edit** | **9.2 / 9.6** |

→ Surpasses all existing methods across all six metrics.

**Table 3: Comparison on WISE (Reasoning-based image generation)**

| Model | Overall |
|-------|---------|
| FLUX.2-klein-9B (base) | 0.47 |
| + **InterleaveThinker** | **0.73** |
| Qwen-Image-Edit-2511 (base) | 0.60 |
| + **InterleaveThinker** | **0.72** |

→ Substantial gains from 0.47 to 0.73 on FLUX.2-klein.

**Table 4: Comparison on RISE-Bench (Reasoning-based image editing)**

| Model | Overall |
|-------|---------|
| FLUX.2-klein-9B (base) | 13.3 |
| + **InterleaveThinker** | **28.9** |
| Qwen-Image-Edit-2511 (base) | 19.4 |
| + **InterleaveThinker** | **30.0** |

→ Major leap from 13.3 to 28.9.

### Ablation Study (Table 5 on UEval)

| Configuration | Avg |
|---------------|-----|
| FLUX.2-klein-9B (baseline) | 18.2 |
| + Gemini-2.5-pro (oracle) | 77.4 |
| + Qwen3-VL-8B (zero-shot baseline) | 48.1 |
| + Planner-SFT | 60.5 |
| + Full-SFT | 64.5 |
| + RL w/o step reward | 65.2 |
| + RL w/o acc reward | 65.1 |
| + **Full-RL** | **66.3** |
| One-Agent (entangled) | 54.5 |
| Unfiltered data | 62.8 |
| \(T_{\max} = 1\) | 60.2 |
| \(T_{\max} = 3\) | 65.3 |
| \(T_{\max} = 5\) | 66.3 |

Key findings: Multi-agent vs. One-Agent (54.5) confirms that entangled planning+evaluation degrades performance. Data filtering is crucial (unfiltered → 62.8 vs 66.3). Increasing \(T_{\max}\) consistently improves results.

## Theoretical and Practical Implications

### Theoretical Implications

1. **Decoupling planning from visual evaluation** provides a principled solution to the visual over-reliance problem. The Planner's upfront global planning (Equation 1) completely bypasses the myopic conditioning on intermediate visual states that plagues single-VLM approaches.

2. **Single-step RL with dual rewards** demonstrates that trajectory-level alignment can be achieved without full-trajectory optimization. This is theoretically significant for long-horizon agentic tasks where reward signals become extremely sparse (>25 steps). The decomposition relies on the insight that ensuring success at each local iteration guarantees the overall success of the global trajectory.

3. **Emergent reasoning capabilities**: Despite being trained exclusively on interleaved generation data (not reasoning tasks), the framework significantly improves performance on reasoning-based benchmarks (WISE: +55%, RISE: +117%). This suggests that the plan-generate-critic paradigm inherently activates chain-of-thought-like reasoning in generation models.

### Practical Implications

1. **Universal applicability**: The framework is model-agnostic, as demonstrated by consistent performance gains across FLUX.2-klein and Qwen-Image-Edit, and can be integrated with any existing frozen image generator.

2. **Broad application domains**: The framework supports visual narratives, how-to guides, robotic manipulation, embodied interaction, and other real-world interleaved tasks (as shown in Figure 1).

3. **Computational efficiency**: The single-step RL formulation drastically reduces computational costs compared to full-trajectory RL (which would be prohibitive for >25 generator calls), making the approach practical for deployment.

4. **Limitation**: The framework is constrained by the base generator's generative prior—it cannot generate concepts not present in the base model's training corpus (e.g., FLUX.2-klein exhibits color shifts for out-of-domain concepts).

## Conclusion

InterleaveThinker is the first multi-agent framework that endows any fixed image generator with robust interleaved generation capabilities. By introducing a decoupled Planner-Gen-Critic workflow, it effectively resolves the visual over-reliance and step-wise error accumulation problems inherent in Unified Multimodal Models.

The framework is supported by three high-quality datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, Interleave-Critic-RL-13k) and a novel dual-reward strategy (accuracy reward + step-wise reward) that achieves trajectory-level alignment through efficient single-step RL via GRPO.

Extensive experiments demonstrate that InterleaveThinker:
1. Significantly outperforms existing open-source UMMs on interleaved generation benchmarks (UEval, CoMM)
2. Achieves performance comparable to proprietary models (Nano Banana, GPT-5)
3. Surprisingly boosts the base model on reasoning-based benchmarks (WISE: 0.47→0.73; RISE: 13.3→28.9)

**Future directions**: Addressing the limitation of base generator constraints for out-of-domain concepts, exploring RL for the Planner agent, and extending the framework to video generation and other sequential modalities.

---

_Markdown view of https://picx.dev/p/lE4y4J, served by PicX — AI-generated visual whiteboard summaries of research papers._
