# PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

> PerceptionRubrics reveals a reliability gap: models pass atomic checks but fail strict conjunctive constraints, exposing perceptual brittleness hidden by saturated benchmarks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.28322)
- **Published:** 2026-07-03
- **Permalink:** https://picx.dev/p/p7Ez40
- **Whiteboard:** https://picx.dev/p/p7Ez40/image

## Summary

## Summary (Overview)

- **PerceptionRubrics** introduces a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness in Multimodal Large Language Models (MLLMs).
- The framework pairs **1,038 information-dense images** with **over 10,000 instance-specific rubrics**, derived from golden captions constructed via a novel **Circular Peer-Review consensus pipeline**.
- Rubrics are organized into a dual-stream system: **Must-Right** (essential facts) and **Easy-Wrong** (fine-grained details), with a **Gated Scoring mechanism** that imposes sharp binary penalties for failure on mandatory visual facts.
- Key findings reveal: (1) a **Reliability Gap** — models pass atomic checks but fail strict conjunctive constraints; (2) a persistent **8% perception deficit** between open-source and proprietary frontiers; and (3) **superior human alignment** compared to conventional benchmarks.
- The benchmark significantly outperforms existing metrics (DOCCI, DetailCaps) in correlating with human preference (Pearson **0.916**, Spearman **1.000**).

## Introduction and Theoretical Foundation

### Background and Motivation

Current perception benchmarks suffer from an **evaluation paradox**: leaderboards are increasingly saturated in the high-score regime, yet models remain perceptually brittle in real-world deployment. Top-tier systems appear nearly tied on metrics but exhibit drastically different failure modes — such as miscounting objects or inverting spatial relations — that are highly salient to users even when reported metric scores remain high.

### Two Systemic Flaws in Existing Benchmarks

1. **Insufficient perceptual detail coverage**: Many benchmarks rely on information-poor images or narrow domains, often framing tasks as closed-form questions that allow models to "shortcut" through linguistic priors rather than genuine visual grounding.
2. **Uncalibrated reward signals**: Conventional metrics rely on linear averaging (e.g., CLIPScore, averaged multi-aspect schemes) that effectively "dilutes" fatal localized errors with general semantic overlap. In contrast, **human perception is strictly non-linear**: a single-digit hallucination in a financial table is a binary failure, not a permissible fluctuation.

### Design Principles

- **Enforcing Perceptual Persistence**: Prioritize complexity over scale using images with extreme information density to invalidate linguistic shortcuts.
- **Calibrating to Human Sensitivity**: Mirror the error-sensitive nature of human judgment where localized errors represent binary failures. Evaluation must be grounded in objective, fact-based checks (True/False) and rigorously penalize hallucinations.

## Methodology

### Image Curation

1,038 images are curated across **seven diverse categories**:
- Natural Scenes
- Document & OCR
- Digital UI & UX
- Structured Data (charts, plots, tables)
- STEM & Expert (scientific diagrams, medical imaging)
- Logic & Puzzle
- Creative & Cultural

A **density-aware filtering** process using Step3-VL-10B scores images on visual complexity and informativeness, retaining those above domain-specific thresholds.

### Caption-Centric Perception Rubric Construction

The pipeline adopts an **intermediary strategy**: first transcribing visual information into text (golden captions), then distilling rules from text, bypassing the visual grounding gap of direct image-to-rubric methods.

#### Generating Golden Captions

A **two-step consensus-driven pipeline**:

1. **Circular Peer-Review**: Three distinct top-tier MLLMs (GPT-5.2, Gemini-3-Pro, Seed-1.8) generate independent descriptions, then iteratively compare, rank, and rewrite candidates to synthesize superior versions (limited to N ≤ 2 iterations).
2. **Strict Consensus Filtering**: Human experts intervene only as final verifiers. A **discard-on-divergence protocol** eliminates samples where models fail to reach unanimous agreement.

#### Generating Perception Rubrics

Using Gemini-3-Pro as the rubric proposer, two complementary streams are constructed:

- **A Priori: Must-Right Rubrics** — Essential perceptual facts that a candidate must correctly identify, with domain-specific adaptive prompts.
- **A Posteriori: Easy-Wrong Rubrics** — Common pitfalls identified by analyzing discrepancies between actual model outputs and golden references, ensuring evaluation penalizes realistic mistakes.

### Evaluation Metric

An **LLM-as-a-Judge** framework (GPT-OSS-120B) evaluates each rubric item as boolean (True/False). The **gated scoring logic** is defined as:

**Must-Right as the Gate**: Let $R_m = \{ r_{m,1}, \dots, r_{m,j} \}$ be the set of Must-Right rubrics. If the model fails any single criterion, the final score is penalized to zero:

$$
G = \prod_{i=1}^{j} \mathbb{I}(r_{m,i} = \text{True})
$$

where $G \in \{0, 1\}$ represents the gate status.

**Easy-Wrong for Granular Differentiation**: For models that pass the gate ($G=1$), the final score $S$ is:

$$
S = G \cdot \frac{1}{k} \sum_{i=1}^{k} \mathbb{I}(r_{e,i} = \text{True})
$$

where $R_e = \{ r_{e,1}, \dots, r_{e,k} \}$ are the Easy-Wrong rubrics.

## Empirical Validation / Results

### Benchmark Statistics

**Table 1: Detailed statistics of PerceptionRubrics.**

| Statistic | Value |
|---|---|
| Number of images | 1,038 |
| Number of captions | 1,038 |
| Average caption length (words) | 770.42 |
| Total number of rubrics | 10,718 |
| Must-right rubrics | 4,053 |
| Easy-wrong rubrics | 6,665 |
| Average rubrics per image | 10.33 |
| Must-right per image | 3.90 |
| Easy-wrong per image | 6.42 |

### Main Results (Table 2 excerpt)

**Table 2: Fine-grained performance breakdown across 7 domains (selected models).**

| Model | Overall (%) | Natural (%) | GUI (%) |
|---|---|---|---|
| **Open-Source** | | | |
| Qwen2.5-VL-7B | 8.37 | 20.70 | 5.13 |
| Qwen3-VL-235B | 41.88 | 56.73 | 33.28 |
| Qwen3.5-397B | 61.61 | 68.51 | 54.76 |
| **Proprietary** | | | |
| GPT-4o-2024 | 12.59 | 23.89 | 7.01 |
| Gemini-3-Pro | 68.79 | 76.65 | 57.57 |
| Seed-2.0-Lite | **70.07** | **79.20** | **59.07** |

- **Seed-2.0-Lite** leads with **70.07%** overall score.
- **GPT-4o-2024-05-13** achieves only **12.59%**.
- Performance is highest on **Natural** domains and lowest on **GUI** (e.g., Qwen2.5-VL-7B drops to **5.13%**).
- Best open-source model (**Qwen3.5**, 61.61%) trails proprietary state-of-the-art by **over 8%**.

### Reliability Gap

- **Atomic Accuracy** (mean accuracy of individual rubrics) is consistently high.
- **Must-Right Pass Rate** (average $G$) is substantially lower, revealing a systematic failure to satisfy strict conjunction of all constraints.
- This gap narrows as model capability increases.

### Consistency of Perceptual Capabilities

A **near-perfect linear correlation** ($R^2 \approx 0.98$) between Must-Right Pass Rate and Easy-Wrong accuracy, indicating that foundational perception is a prerequisite for fine-grained understanding.

## Theoretical and Practical Implications

### Human Alignment

PerceptionRubrics achieves the strongest alignment with human preference (Vision Arena) among compared benchmarks:

| Metric | DOCCI | DetailCaps | **PerceptionRubrics** |
|---|---|---|---|
| Pearson | Weak | Moderate | **0.916** |
| Spearman | Low | Low | **1.000** |

- Existing benchmarks like DOCCI assign nearly indistinguishable scores to models with markedly different human-preference ratings.
- PerceptionRubrics provides a **more discriminative and human-aligned signal** for fine-grained perception evaluation.

### Resistance to Length Bias

- Gemini-3.1-Pro shows **no statistically significant correlation** ($r = -0.079$, $p = 0.0758$) between caption length and score.
- Kimi-K2.6 shows a **weak positive correlation** ($r = 0.172$, $p = 1.09 \times 10^{-4}$).
- PerceptionRubrics effectively **decouples verbosity from evaluation outcomes**, rewarding precise and verifiable perception.

### Evaluation Robustness

- Repeated evaluations with two different judges (GPT-OSS-120B and GPT-5.5) yield **identical ranking orders** despite systematic score differences (~6.0%).
- Standard deviations remain consistently low across all configurations.

### Rubric Coverage vs. Evaluation Stability

- Evaluation stability improves **monotonically** as rubric coverage increases from 20% to 80%.
- Sufficient rubric coverage is a prerequisite for **stable and reproducible perception assessment**.

## Conclusion

PerceptionRubrics presents a rubric-based benchmark that calibrates multimodal evaluation to human perceptual judgment by decomposing dense image understanding into **atomic, verifiable rubrics** and enforcing a **gated scoring mechanism**. Key contributions:

- Exposes perceptual failures hidden by existing metrics, revealing a **clear reliability gap** between individual fact recognition and consistent conjunctive perception.
- Quantifies persistent weaknesses in **information-dense domains** such as GUIs.
- Demonstrates **strong alignment with human preferences**, outperforming conventional benchmarks.

The findings suggest that reliable multimodal evaluation should move beyond coarse similarity and explicitly audit critical visual facts. The framework provides a **sharper diagnostic tool** for measuring perceptual reliability and guiding the development of more trustworthy MLLMs.

**Future directions**: Extending the rubric-based approach to other multimodal tasks and further improving the automatic generation of rubrics to reduce human annotation costs.

---

_Markdown view of https://picx.dev/p/p7Ez40, served by PicX — AI-generated visual whiteboard summaries of research papers._
