Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Summary (Overview)

  • Comprehensive Benchmark Suite: Introduces Edit-Compass (2,388 instances) for image editing models and EditReward-Compass (2,251 preference pairs) for reward models, addressing limitations in existing benchmarks (insufficient difficulty, coarse-grained evaluation, and unrealistic reward modeling scenarios).
  • Fine-Grained, Human-Aligned Evaluation: Proposes a multi-dimensional evaluation framework based on structured reasoning and scoring rubrics across Instruction Awareness, Visual Consistency, and Visual Quality, achieving higher alignment with human judgment than prior benchmarks.
  • Reveals Significant Performance Gaps: Extensive evaluation of 29 image editing models shows a substantial gap between proprietary (e.g., Nano Banana Pro: 3.99) and open-source models (best: Qwen-Image-Edit: 2.69), with persistent weaknesses in World Knowledge Reasoning, Algorithmic Visual Reasoning, and Multi-Image tasks.
  • Novel Finding on Reward Models: Evaluation of 21 reward models reveals that native multimodal large language models (MLLMs) (e.g., Qwen3.5/3.6 series) often outperform open-source reward models trained on preference data, suggesting strong inherent capability for visual assessment.
  • Realistic Reward Modeling Benchmark: EditReward-Compass constructs preference pairs using a FlowGRPO-inspired sampling strategy to better simulate decision-making scenarios encountered during RL optimization, moving beyond simple cross-model comparisons.

Introduction and Theoretical Foundation

Recent image editing models have evolved towards advanced capabilities involving multimodal understanding, complex reasoning, and multi-image editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to:

  1. Limited Task Difficulty: Benchmarks do not adequately challenge models on reasoning-intensive tasks.
  2. Coarse-Grained Evaluation Protocols: Reliance on automated metrics (CLIP-I, DINO-I) or simple MLLM judging prompts leads to unstable assessments.

In parallel, reward models are crucial for Reinforcement Learning (RL)-based image editing optimization (e.g., FlowGRPO). Yet, existing reward model benchmarks suffer from a distribution mismatch between evaluation samples and the edited images encountered during practical RL training.

These limitations hinder reliable assessment of both image editing models and their corresponding reward models. To address these challenges, this paper introduces Edit-Compass and EditReward-Compass, a unified evaluation suite designed to provide a comprehensive, human-aligned, and realistic framework for evaluating frontier systems.

Methodology

1. Edit-Compass Benchmark Construction

Edit-Compass contains 2,388 carefully annotated instances spanning 36 progressively challenging task categories across six main groups:

Task CategorySub-Tasks (Examples)Core Capability Evaluated
General Tasks (#740)Subject Addition/Remove/Replace, Change Color/Size/Material, Style Transfer, Text Editing, etc.Fundamental instruction understanding and accurate execution.
Dynamic Manipulation (#314)Action, Emotion Change, Object Movement/Swap/Interaction.Dynamic scene understanding and object-level interaction modeling.
World Knowledge Reasoning (#350)Temporal, Causal, Game, Math, Chemical Reasoning.Leveraging real-world knowledge to infer and execute intended edits.
Algorithmic Visual Reasoning (#560)Optimal Path, Convex Hull, Knapsack, Longest Word, etc. (10 subtasks).Interpreting visual inputs and performing multi-step algorithmic reasoning.
Multi-Image Tasks (#187)Multi-Image Awareness, Composition, Virtual Try-On.Understanding and integrating information from multiple input images.
Complex Tasks (#237)Complex Instruction, Complex Paint (EN/CN).Handling compound instructions and multimodal (text+visual) guidance.

Data Construction Strategies:

  • General & Complex Tasks: Original images collected from online resources; instructions generated by Gemini 3 Pro/GPT-5.1 and verified by humans.
  • Dynamic Manipulation, World Knowledge, Multi-Image: Experts design scenarios; source images generated from enhanced prompts.
  • Algorithmic Visual Reasoning: Source images programmatically generated using Python with ground-truth annotations derived from algorithmic solutions (e.g., Dijkstra's algorithm, dynamic programming).

2. Edit-Compass Evaluation Pipeline

A fine-grained, multi-dimensional evaluation framework using an MLLM-as-judge approach (Gemini-3.1-Pro) with structured prompts and scoring rubrics.

Three Core Dimensions:

  1. Instruction Awareness (IA): Evaluates correct instruction following and incorporation of relevant world knowledge. IA=1MIAmMIAm,MIA{IF,WA}\text{IA} = \frac{1}{|\mathcal{M}_{\text{IA}}|} \sum_{m \in \mathcal{M}_{\text{IA}}} m, \quad \mathcal{M}_{\text{IA}} \subseteq \{\text{IF}, \text{WA}\} where IF = Instruction Following, WA = World Knowledge Awareness.
  2. Visual Consistency (VC): Measures preservation of unedited content and object identity. VC=1MVCmMVCm,MVC{URC,IC}\text{VC} = \frac{1}{|\mathcal{M}_{\text{VC}}|} \sum_{m \in \mathcal{M}_{\text{VC}}} m, \quad \mathcal{M}_{\text{VC}} \subseteq \{\text{URC}, \text{IC}\} where URC = Unedited Region Consistency, IC = Identity Consistency.
  3. Visual Quality (VQ): Evaluates visual plausibility, coherence, and artifact-free nature.

Overall Score Calculation: A weighted geometric mean is used, with category-specific weights:

  • General, Dynamic Manipulation, Multi-Image, Complex: (wIA,wVC,wVQ)=(0.4,0.4,0.2)(w_{\text{IA}}, w_{\text{VC}}, w_{\text{VQ}}) = (0.4, 0.4, 0.2)
  • World Knowledge Reasoning: (0.5,0.3,0.2)(0.5, 0.3, 0.2)
  • Algorithmic Visual Reasoning: (0.6,0.2,0.2)(0.6, 0.2, 0.2)

3. EditReward-Compass Benchmark Construction

Contains 2,251 preference pairs simulating realistic reward modeling scenarios during RL optimization.

Two-Stage Construction:

  1. Sampling Stage: Uses Edit-Compass as source data. Candidate outputs are sampled from diverse image editing models using a FlowGRPO-inspired strategy, introducing stochasticity to mimic RL optimization conditions.
  2. Human Annotation Stage: An eight-expert, two-stage pipeline ensures high-quality preference pairs with clear differences in instruction adherence, visual consistency, and quality.

Empirical Validation / Results

1. Image Editing Model Results (Edit-Compass)

Extensive evaluation of 29 models (25 open-source, 4 proprietary).

Key Quantitative Results (English Instructions):

Table 3: Main results on Edit-Compass under English instructions. Best results in bold for open- and closed-source models.

ModelGeneral (IA/VC/VQ)Dynamic Manipulation (IA/VC/VQ)World Knowledge (IA/VC/VQ)Visual Reasoning (IA/VC/VQ)Multi Image (IA/VC/VQ)Complex (IA/VC/VQ)Overall AVG
Qwen-Image-Edit-2511 (Open)3.81 / 3.93 / 3.162.33 / 3.56 / 3.251.26 / 2.53 / 4.093.27 / 3.55 / 2.921.90 / 2.84 / 3.802.80 / 3.81 / 3.022.69
Nano Banana Pro (Closed)4.54 / 4.58 / 3.794.33 / 4.49 / 4.283.61 / 4.25 / 4.734.43 / 4.29 / 3.443.60 / 3.62 / 4.434.28 / 4.40 / 3.533.99

Main Findings:

  • Proprietary vs. Open-source Gap: The best proprietary model (Nano Banana Pro) achieves an overall score of 3.99, while the strongest open-source model (Qwen-Image-Edit) reaches only 2.69.
  • Task-Specific Weaknesses: Open-source models are competitive on basic tasks (General, Dynamic Manipulation) but show substantial gaps on more challenging categories like World Knowledge Reasoning (Nano Banana Pro: 3.89 vs. Qwen-Image-Edit: 1.74) and Algorithmic Visual Reasoning.
  • Cross-Lingual Performance: Some models show imbalance (better in English), but advanced unified models exhibit only marginal differences, indicating robust multilingual understanding requires balanced training.

2. Reward Model Results (EditReward-Compass)

Evaluation of 21 models across three categories: open-source general-purpose MLLMs, preference-trained reward models, and proprietary models.

Key Quantitative Results:

Table 5: Main results on EditReward-Compass. /§ denote specific backbones; denotes thinking-enabled version.

MethodInstruction AwarenessVisual ConsistencyVisual QualityAVG
Qwen3.5-9B (Open MLLM)0.66820.50750.46350.6016
Qwen3.5-9B ‡ (Open MLLM w/ Thinking)0.76150.48600.48980.6681
EditScore † (Preference-Trained)0.50920.41600.58900.4912
Gemini 3.1 Pro (Proprietary)0.83240.60020.44590.7433

Main Findings:

  • Native MLLMs Outperform Trained Reward Models: Native multimodal models (e.g., Qwen3.5/3.6 series) achieve stronger overall performance than existing open-source reward models explicitly trained on preference data (e.g., EditScore, EditReward).
  • Effect of Thinking-Enabled Inference: Enabling chain-of-thought reasoning consistently improves performance, with gains up to ~10 points for some models.
  • Impact of System Prompts: The prompts designed for EditReward-Compass consistently outperform those from EditScore, with the largest gain of 12.93% on Qwen3-VL-8B.
  • Visual Consistency is Challenging: For all model types, Visual Consistency remains a more difficult dimension than Instruction Awareness.

3. Analysis and Additional Findings

  • Human-Aligned Evaluation: The proposed evaluation protocol shows a high Pearson correlation with human ratings and is more preferred by human annotators compared to protocols from existing benchmarks (ImgEdit-Bench, GEdit-Bench, RISE-Bench).
  • Visual Perception Ability: Open-source models perform well on basic tasks (e.g., Object Movement) but struggle with complex tasks like Object Swap and Complex Paint (which requires interpreting in-image visual annotations).
  • Algorithmic Visual Reasoning: This category remains a major challenge. Even closed-source models show limited overall performance, indicating current systems struggle to perform visual reasoning and faithfully execute derived edits.

Theoretical and Practical Implications

  • Standardized, Challenging Evaluation: Edit-Compass provides a comprehensive and progressively difficult benchmark that can reliably distinguish capability differences among advanced image editing models, guiding future research towards addressing identified weaknesses.
  • Realistic Reward Model Assessment: EditReward-Compass moves reward model evaluation closer to practical RL scenarios, enabling more faithful assessment of their quality and training effectiveness for image editing optimization.
  • Insight into Model Capabilities: The benchmark reveals that while current models perform reasonably well on shallow perception-level tasks, they still fundamentally struggle with deeper reasoning, world knowledge integration, and complex multi-image editing. This highlights a crucial direction for future model development.
  • Potential of Native MLLMs as Reward Models: The strong performance of native MLLMs suggests they possess inherent, strong capabilities for visual assessment, which could be leveraged more directly in RL pipelines or for developing more efficient reward models.

Conclusion

The paper introduces Edit-Compass and EditReward-Compass, a unified benchmark suite that enables systematic, human-aligned evaluation of frontier image editing systems and reward models. Key takeaways:

  • The benchmarks reveal a substantial performance gap between proprietary and open-source image editing models, with persistent weaknesses in reasoning-intensive and multi-image tasks.
  • Native multimodal large language models show strong potential as reward models, even outperforming some models explicitly trained on preference data.
  • The proposed fine-grained, rubric-based evaluation framework demonstrates higher alignment with human judgment compared to existing protocols.

Limitation & Future Work: The current evaluation relies on API-based MLLM judges, which may be affected by judge capabilities and version updates. Future work plans to develop a dedicated image-editing judge model for more stable and transparent evaluation.