Visual Summary | PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Summary (Overview)

PerceptionRubrics introduces a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness in Multimodal Large Language Models (MLLMs).
The framework pairs 1,038 information-dense images with over 10,000 instance-specific rubrics, derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline.
Rubrics are organized into a dual-stream system: Must-Right (essential facts) and Easy-Wrong (fine-grained details), with a Gated Scoring mechanism that imposes sharp binary penalties for failure on mandatory visual facts.
Key findings reveal: (1) a Reliability Gap — models pass atomic checks but fail strict conjunctive constraints; (2) a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) superior human alignment compared to conventional benchmarks.
The benchmark significantly outperforms existing metrics (DOCCI, DetailCaps) in correlating with human preference (Pearson 0.916, Spearman 1.000).

Introduction and Theoretical Foundation

Background and Motivation

Current perception benchmarks suffer from an evaluation paradox: leaderboards are increasingly saturated in the high-score regime, yet models remain perceptually brittle in real-world deployment. Top-tier systems appear nearly tied on metrics but exhibit drastically different failure modes — such as miscounting objects or inverting spatial relations — that are highly salient to users even when reported metric scores remain high.

Two Systemic Flaws in Existing Benchmarks

Insufficient perceptual detail coverage: Many benchmarks rely on information-poor images or narrow domains, often framing tasks as closed-form questions that allow models to "shortcut" through linguistic priors rather than genuine visual grounding.
Uncalibrated reward signals: Conventional metrics rely on linear averaging (e.g., CLIPScore, averaged multi-aspect schemes) that effectively "dilutes" fatal localized errors with general semantic overlap. In contrast, human perception is strictly non-linear: a single-digit hallucination in a financial table is a binary failure, not a permissible fluctuation.

Design Principles

Enforcing Perceptual Persistence: Prioritize complexity over scale using images with extreme information density to invalidate linguistic shortcuts.
Calibrating to Human Sensitivity: Mirror the error-sensitive nature of human judgment where localized errors represent binary failures. Evaluation must be grounded in objective, fact-based checks (True/False) and rigorously penalize hallucinations.

Methodology

Image Curation

1,038 images are curated across seven diverse categories:

Natural Scenes
Document & OCR
Digital UI & UX
Structured Data (charts, plots, tables)
STEM & Expert (scientific diagrams, medical imaging)
Logic & Puzzle
Creative & Cultural

A density-aware filtering process using Step3-VL-10B scores images on visual complexity and informativeness, retaining those above domain-specific thresholds.

Caption-Centric Perception Rubric Construction

The pipeline adopts an intermediary strategy: first transcribing visual information into text (golden captions), then distilling rules from text, bypassing the visual grounding gap of direct image-to-rubric methods.

Generating Golden Captions

A two-step consensus-driven pipeline:

Circular Peer-Review: Three distinct top-tier MLLMs (GPT-5.2, Gemini-3-Pro, Seed-1.8) generate independent descriptions, then iteratively compare, rank, and rewrite candidates to synthesize superior versions (limited to N ≤ 2 iterations).
Strict Consensus Filtering: Human experts intervene only as final verifiers. A discard-on-divergence protocol eliminates samples where models fail to reach unanimous agreement.

Generating Perception Rubrics

Using Gemini-3-Pro as the rubric proposer, two complementary streams are constructed:

A Priori: Must-Right Rubrics — Essential perceptual facts that a candidate must correctly identify, with domain-specific adaptive prompts.
A Posteriori: Easy-Wrong Rubrics — Common pitfalls identified by analyzing discrepancies between actual model outputs and golden references, ensuring evaluation penalizes realistic mistakes.

Evaluation Metric

An LLM-as-a-Judge framework (GPT-OSS-120B) evaluates each rubric item as boolean (True/False). The gated scoring logic is defined as:

Must-Right as the Gate: Let $R_m = \{ r_{m,1}, \dots, r_{m,j} \}$ be the set of Must-Right rubrics. If the model fails any single criterion, the final score is penalized to zero:

G = \prod_{i=1}^{j} \mathbb{I}(r_{m,i} = \text{True})

where $G \in \{0, 1\}$ represents the gate status.

Easy-Wrong for Granular Differentiation: For models that pass the gate ( $G=1$ ), the final score $S$ is:

S = G \cdot \frac{1}{k} \sum_{i=1}^{k} \mathbb{I}(r_{e,i} = \text{True})

where $R_e = \{ r_{e,1}, \dots, r_{e,k} \}$ are the Easy-Wrong rubrics.

Empirical Validation / Results

Benchmark Statistics

Table 1: Detailed statistics of PerceptionRubrics.

Statistic	Value
Number of images	1,038
Number of captions	1,038
Average caption length (words)	770.42
Total number of rubrics	10,718
Must-right rubrics	4,053
Easy-wrong rubrics	6,665
Average rubrics per image	10.33
Must-right per image	3.90
Easy-wrong per image	6.42

Main Results (Table 2 excerpt)

Table 2: Fine-grained performance breakdown across 7 domains (selected models).

Model	Overall (%)	Natural (%)	GUI (%)
Open-Source
Qwen2.5-VL-7B	8.37	20.70	5.13
Qwen3-VL-235B	41.88	56.73	33.28
Qwen3.5-397B	61.61	68.51	54.76
Proprietary
GPT-4o-2024	12.59	23.89	7.01
Gemini-3-Pro	68.79	76.65	57.57
Seed-2.0-Lite	70.07	79.20	59.07

Seed-2.0-Lite leads with 70.07% overall score.
GPT-4o-2024-05-13 achieves only 12.59%.
Performance is highest on Natural domains and lowest on GUI (e.g., Qwen2.5-VL-7B drops to 5.13%).
Best open-source model (Qwen3.5, 61.61%) trails proprietary state-of-the-art by over 8%.

Reliability Gap

Atomic Accuracy (mean accuracy of individual rubrics) is consistently high.
Must-Right Pass Rate (average $G$ ) is substantially lower, revealing a systematic failure to satisfy strict conjunction of all constraints.
This gap narrows as model capability increases.

Consistency of Perceptual Capabilities

A near-perfect linear correlation ( $R^2 \approx 0.98$ ) between Must-Right Pass Rate and Easy-Wrong accuracy, indicating that foundational perception is a prerequisite for fine-grained understanding.

Theoretical and Practical Implications

Human Alignment

PerceptionRubrics achieves the strongest alignment with human preference (Vision Arena) among compared benchmarks:

Metric	DOCCI	DetailCaps	PerceptionRubrics
Pearson	Weak	Moderate	0.916
Spearman	Low	Low	1.000

Existing benchmarks like DOCCI assign nearly indistinguishable scores to models with markedly different human-preference ratings.
PerceptionRubrics provides a more discriminative and human-aligned signal for fine-grained perception evaluation.

Resistance to Length Bias

Gemini-3.1-Pro shows no statistically significant correlation ( $r = -0.079$ , $p = 0.0758$ ) between caption length and score.
Kimi-K2.6 shows a weak positive correlation ( $r = 0.172$ , $p = 1.09 \times 10^{-4}$ ).
PerceptionRubrics effectively decouples verbosity from evaluation outcomes, rewarding precise and verifiable perception.

Evaluation Robustness

Repeated evaluations with two different judges (GPT-OSS-120B and GPT-5.5) yield identical ranking orders despite systematic score differences (~6.0%).
Standard deviations remain consistently low across all configurations.

Rubric Coverage vs. Evaluation Stability

Evaluation stability improves monotonically as rubric coverage increases from 20% to 80%.
Sufficient rubric coverage is a prerequisite for stable and reproducible perception assessment.

Conclusion

PerceptionRubrics presents a rubric-based benchmark that calibrates multimodal evaluation to human perceptual judgment by decomposing dense image understanding into atomic, verifiable rubrics and enforcing a gated scoring mechanism. Key contributions:

Exposes perceptual failures hidden by existing metrics, revealing a clear reliability gap between individual fact recognition and consistent conjunctive perception.
Quantifies persistent weaknesses in information-dense domains such as GUIs.
Demonstrates strong alignment with human preferences, outperforming conventional benchmarks.

The findings suggest that reliable multimodal evaluation should move beyond coarse similarity and explicitly audit critical visual facts. The framework provides a sharper diagnostic tool for measuring perceptual reliability and guiding the development of more trustworthy MLLMs.

Future directions: Extending the rubric-based approach to other multimodal tasks and further improving the automatic generation of rubrics to reduce human annotation costs.