Summary (Overview)

Introduces CiteVQA: A novel Document Visual Question Answering (Doc-VQA) benchmark that requires models to provide both correct answers and precise, element-level bounding-box citations as supporting evidence. It comprises 1,897 questions from 711 multi-page, multi-domain PDFs.
Proposes Strict Attributed Accuracy (SAA): A core evaluation metric that credits a prediction only when the answer is correct and the cited evidence region is correct, moving beyond answer-only evaluation.
Exposes "Attribution Hallucination": A critical failure mode where models generate correct answers but ground them in incorrect or irrelevant visual evidence. An audit of 20 MLLMs shows a large gap between answer accuracy and SAA, with the strongest model (Gemini-3.1-Pro-Preview) achieving only 76.0 SAA.
Develops Automated Pipeline: A scalable, automated annotation pipeline for generating high-quality question-answer-citation triplets, validated by expert review, overcoming the cost and inconsistency of manual granular annotation.

Introduction and Theoretical Foundation

Multimodal Large Language Models (MLLMs) have advanced document understanding, but current Doc-VQA evaluations focus solely on final answer accuracy. This masks a critical reliability issue: a model can arrive at a correct answer while basing it on the wrong source passage—a phenomenon termed "Attribution Hallucination." This is a severe risk in high-stakes domains like law, finance, and medicine, where conclusions must be traceable to specific evidence.

The theoretical foundation is the need for Trustworthy Document Intelligence, which requires not just information extraction but also faithful evidence attribution. Existing benchmarks either lack evidence annotations or evaluate evidence and answers separately, failing to provide a joint, sample-level audit of reasoning faithfulness.

To address this, the paper introduces CiteVQA, a benchmark designed to evaluate models on their ability to perform faithful evidence attribution by requiring element-level visual citations alongside each answer.

Methodology

1. Dataset Construction via Automated Pipeline

A four-stage automated pipeline creates the CiteVQA dataset:

Multi-doc Linking: Aggregates semantically related documents into groups $D$ via vector similarity and LLM-based section alignment to support cross-document reasoning.
Evidence Package Extraction: Uses MinerU2.5 for fine-grained document parsing and MLLM agents to navigate the parsed bounding-box (BBox) space, concatenating scattered facts into a cohesive Evidence Package.
QA Construction: Distills real-world questions from open-source datasets into logical templates (e.g., Factual Retrieval, Complex Synthesis). An MLLM selects a template and synthesizes a QA pair based on the Evidence Package.
Quality Control: Includes:
- Answerability Verification: An MLLM confirms the question is answerable from the evidence.
- Relevance Filtering: Questions answerable without document context (common knowledge) are discarded.
- Crucial Evidence Identification: An ablation procedure where each BBox in the Evidence Package is masked; if masking it causes an MLLM to fail, that element is labeled as "Crucial Evidence" $B_{crucial}$ .

2. Dataset Statistics

CiteVQA is diverse and complex, as summarized in Table 2:

711 documents across 7 macro-domains (Business Finance, Academic Tech, etc.) and 30 sub-categories.
Average document length: 40.6 pages.
1,897 questions covering:
- Scenarios: Single-doc (52.0%), Multi-doc with one gold document (25.7%), Multi-doc with multiple gold documents (22.3%).
- Question Types: Complex Synthesis (44.23%), Factual Retrieval (26.30%), Multimodal Parsing (18.56%), Quantitative Reasoning (10.91%).
Evidence sources are 70.12% text, 21.99% tables, 7.04% images, and 0.84% equations.

3. Evaluation Metrics

Each sample is $(D, Q, A_{gt}, B_{gt})$ , with model output $\hat{Y} = \{(A_1, b_1), ..., (A_n, b_n)\}$ and $B_{pred} = \{b_1, ..., b_n\}$ .

The key metrics are:

Recall (Rec.): Coarse-grained localization of crucial evidence. $Rec. = \frac{1}{|B_{crucial}|} \sum_{b_{gt} \in B_{crucial}} \mathbb{1}\left[ \max_{b_{pred} \in B_{pred}} IoU(b_{pred}, b_{gt}) \geq 0.5 \right]$
Relevance (Rel.): Logical alignment between each predicted evidence and its answer, scored 0–5 by an LLM judge $J_{rel}$ . $Rel. = \frac{1}{n} \sum_{i=1}^{n} J_{rel}(A_i, b_i) \in [0,5]$
Answer Correctness (Ans.): Semantic match between predicted and ground-truth answers, scored 0–5 by an LLM judge $J_{ans}$ . $Ans. = J_{ans}(\{A_1, A_2, ..., A_n\}, A_{gt}) \in [0,5]$
Strict Attributed Accuracy (SAA): The sample-level binary metric requiring both high answer quality and correct grounding. $SAA = \mathbb{1}(Ans. \geq 4 \land (Rel. \geq 4 \lor Rec. \geq 0.6))$ Scores are normalized to a 100-point scale for reporting (Rel. and Ans. multiplied by 20).

Additional metrics include Page-level Recall, Precision, and F1-score for comprehensive localization assessment.

Empirical Validation / Results

The paper evaluates 20 state-of-the-art MLLMs (closed-source, open-source large, open-source small) on CiteVQA. The main results are presented in Table 3.

Table 3: Comprehensive Evaluation of CiteVQA across Different Document Scenarios (Overall scores shown).

Model	Rec.	Rel.	Ans.	SAA
Closed-source MLLMs
Gemini-3.1-Pro-Preview	66.0	83.6	86.1	76.0
Gemini-3-Flash-Preview	45.4	75.7	84.5	65.4
GPT-5.4	31.0	67.5	87.1	59.0
GPT-5.2	18.2	56.6	71.5	33.7
Qwen3.6-Plus	7.7	25.0	85.9	17.5
Open-source Large MLLMs
Qwen3-VL-235B-A22B	11.3	35.3	72.3	22.5
Qwen3.5-27B	5.3	25.3	75.6	17.3
Open-source Small MLLMs
Qwen3-VL-8B	1.0	14.7	61.2	7.5

Key Findings:

Pervasive Attribution Hallucination: A significant gap exists between Ans. and SAA across all models. For example, GPT-5.4 has an Ans. of 87.1 but an SAA of only 59.0. Models frequently give correct answers while citing wrong evidence.
Large Performance Disparity: Closed-source models dominate. The strongest open-source model (Qwen3-VL-235B) achieves only 22.5 SAA, and small models often fall below 10.0, indicating high risk for deployment in critical domains.
Difficulty Scales with Complexity: Performance degrades from Single-Doc to Multi-Doc scenarios. For instance, Gemini-3.1-Pro-Preview's Recall drops from 68.9 (Single-Doc) to 55.3 (Multi N-Gold).
Coarse-grained Attribution is Deficient: As shown in supplementary Table 12, even Page-level Recall is low for many models (e.g., 57.8% for Qwen3-VL-235B), indicating models often fail to locate the correct page, not just the precise BBox.
Question Type Analysis: Models perform best on Quantitative Reasoning (objective logic) and worst on Multimodal Parsing (requires locating elements by descriptive cues before parsing).
Evidence Attribution as a Performance Driver: Ablation studies (Table 4) show that providing the model with the correct pages or gold documents (narrowing search space) leads to performance gains (e.g., +13.4% Ans. for Qwen3-VL-8B), suggesting better autonomous attribution could improve answer capability.

Theoretical and Practical Implications

Theoretical Implications:

Shifts Evaluation Paradigm: CiteVQA moves Doc-VQA evaluation from answer-centric to joint evidence-answer verification, establishing a new standard for measuring reasoning faithfulness and traceability.
Highlights a Critical Gap: The discovery of widespread "Attribution Hallucination" reveals a fundamental logical fracture in current MLLMs that answer-only benchmarks completely overlook.
Suggests Synergy: The observed correlation between improving evidence quality and answer accuracy (Figure 6) hints that precise evidence localization may be more than post-hoc justification—it could be a functional foundation for correct reasoning in complex tasks.

Practical Implications:

Risk Assessment for High-Stakes Applications: The low SAA scores, especially for open-source and small models, provide clear instrumentation showing these models are not yet reliable for domains like law, finance, and medicine where traceability is mandatory.
Guidance for Model Development: CiteVQA provides a rigorous testbed for developing and improving models with faithful attribution capabilities. The benchmark and metrics can drive research towards more interpretable and trustworthy document intelligence systems.
Enables Reliable Auditing: The SAA metric and the benchmark allow for the auditing of model reliability in a way that mimics real-world professional scrutiny, where an unsourced correct answer is still considered unreliable.

Conclusion

CiteVQA advances the field towards trustworthy document intelligence by introducing a benchmark that requires faithful evidence attribution via element-level visual citations. The automated pipeline enables the creation of a large-scale, high-quality, multi-domain dataset. The comprehensive evaluation exposes the critical and pervasive issue of "Attribution Hallucination," where state-of-the-art models generate correct answers grounded in incorrect evidence. By providing the rigorous instrumentation needed to measure and close this reliability gap, CiteVQA establishes a new standard for developing interpretable and reliable multimodal systems for real-world, high-stakes applications.