Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Summary (Overview)

Introduces MADQA: A benchmark of 2,250 human-authored questions over 800 heterogeneous PDF documents designed to evaluate multimodal agentic systems' strategic reasoning over document collections.
Defines Agentic Document Collection VQA: A task formalized with six core properties (Extractive, Multi-Hop, Closed-World, Grounded, Agentic, Visual) distinguishing it from standard document QA.
Develops Novel Evaluation Protocol: Measures not only accuracy and attribution (Page/Doc F1) but also a novel calibration metric (Kuiper statistic) to assess the efficiency-effort trade-off of agentic behavior.
Key Finding: Best agents (e.g., Gemini 3 Pro BM25 Agent) can match human accuracy (~82%) but succeed on largely different questions and rely on brute-force search, failing to close a nearly 20% oracle gap. They exhibit poor effort calibration compared to humans.
Releases Dataset & Framework: Provides a validated benchmark, evaluation harness, and baseline implementations to facilitate research on efficient, strategic reasoning over documents.

Introduction and Theoretical Foundation

The paper addresses a critical gap in evaluating multimodal large language model (MLLM) based agentic systems for complex, multi-stage information retrieval and reasoning tasks in enterprise settings. Existing benchmarks are fragmented, suffering from limitations in:

Format: Ignoring visual comprehension required for real-world PDFs.
Scope: Being domain-specific or using single-step metrics that fail to capture iterative planning.
Data Integrity: Using MLLM-generated questions/answers (introducing bias) or recycling documents (risk of contamination).

To address this, the authors introduce the Multimodal Agentic Document QA (MADQA) benchmark. The core task is defined as: Given a corpus $C$ of multi-page documents and a natural language query $q$ , produce an answer $a$ and a minimal evidence set $E \subseteq C$ .

The task is distinguished by six formal properties:

#	Property	Definition
1	Extractive	Answer tokens must appear physically in the evidence set $E$ .
2	Multi-Hop	$E$ may span disjoint pages (cross-page) or documents (cross-doc).
3	Closed-World	Answer derived solely from $C$ ; no external parametric knowledge.
4	Grounded	$E$ must entail $a$ and be minimal (no superfluous pages).
5	Agentic	No single retrieval query $q'$ may exist such that $\text{RETRIEVE}(q') \supseteq E$ .
6	Visual	Answering may require non-textual information (layout, tables, figures) in $E$ .

Properties 1, 3, and 4 are enforced by construction; properties 2, 5, and 6 are targeted by design. The agentic property necessitates planning, navigation, and aggregation.

Methodology

Dataset Construction

Documents: 800 PDFs manually curated from DocumentCloud, covering 13 high-level domains (Financial, Reports, Government, Legal, etc.) with high layout diversity (see Figure 2 heatmap). Documents range from single-page to 800+ pages.
Questions: 2,250 QA pairs authored by professional annotators following strict guidelines (answerable from documents, specific but not revealing source easily). Distribution: 82.7% single-hop, 17.3% multi-hop (8.3% cross-page, 9.0% cross-doc).
Quality Assurance: Two-step verification using GPT-5 with oracle evidence and manual review. Annotators with zero errors participated.
Construct Validity Analysis:
- Lexical Overlap: Precision of unigram/bigram/trigram matching with gold evidence is very low, confirming need for semantic understanding.
- Parametric Knowledge: Models' "guessability" (answering without documents) averages 11.2%, with ~8% attributed to training data contamination.
- Visual Necessity: Only 42% of questions can be answered from free text alone; 58% benefit from understanding structured layouts, tabular data, or visual artifacts (see Figure 4).

Principled Splits Creation

Applying Classical Test Theory, questions were evaluated based on Difficulty (mean accuracy) and Discrimination (point-biserial correlation). A test set ( $n=500$ ) and development set ( $n=200$ ) were created to maximize discrimination. Crucially, 20% of the test set (100 items, the "Sentinel Pool") consists of items no current model can solve, ensuring long-term relevance. The remaining 1,550 items form a Train set. The Test set achieves strong rank correlation with the full benchmark (Spearman’s $\rho > 0.85$ ).

Evaluation Protocol

Answer Correctness (Extractive Property): Uses an LLM-based judge calibrated to human judgments. Achieves quadratic-weighted Cohen’s $\kappa = 0.88$ with human judgments on non-exact-match cases.
Retrieval and Attribution (Grounded Property): Uses Page F1 (overlap between cited pages and minimal evidence set $E$ ) and Doc F1 (document-level overlap). High Doc F1 with low Page F1 indicates "last-mile" navigation failure.
Efficiency and Calibration (Agentic Property): Introduces a novel metric based on the Cumulative Difference method. Given evaluation tuples $\{(s_i, y_i)\}_{i=1}^N$ with effort $s_i \in \mathbb{N}$ and correctness $y_i \in \{0,1\}$ , sorted by nondecreasing effort via permutation $\pi$ , and mean accuracy $\bar{y} = \frac{1}{N}\sum_{i=1}^N y_i$ , the cumulative deviation curve is: $D_0 = 0, \quad D_k = \sum_{j=1}^k (y_{\pi(j)} - \bar{y})$ The Kuiper range statistic quantifies the dependency between effort and accuracy: $K = \max_{0 \leq k \leq N} D_k - \min_{0 \leq k \leq N} D_k$ A low $K$ indicates stable, "effort-invariant" performance or good calibration. A high $K$ reveals poor calibration, where the agent expends significant budget on queries it fails to solve. Effort is measured as step counts (tool calls).

Baseline Approaches

Multiple baseline systems were evaluated:

BM25 MLLM Agent: Iterative system coupling text-based BM25 retrieval with a MLLM analyzing rendered page images.
Claude Agent with Semtools: Uses Claude Agents SDK with Unix-style tools (parse, search) and bash commands.
Recursive Language Models (RLM): Task-agnostic approach allowing LLMs to programmatically examine and recursively process the input.
MDocAgent: Fixed five-stage pipeline of specialized agents (General, Critical, Text, Image, Summarizing).
Managed RAG Services: "RAG-as-a-Service" solutions (Gemini File Search, OpenAI Assistants File Search).
M3DocRAG: Visual retrieval system encoding pages as images.
HEAVEN: Hybrid visual retrieval using DSE for candidate retrieval and ColQwen2.5 for re-ranking.
ColBERTv2 + LLaMA: Text-only late-interaction retrieval baseline.
Human Performance: Annotators using the same BM25 search engine ("Human BM25 Agent") and with oracle retrieval ("Human Oracle Retriever").

Empirical Validation / Results

Main Evaluation Results

Table 3 presents the main results. Key findings:

Table 3. Main evaluation results on MADQA.

Model / Framework	Accuracy	X-Page	X-Doc	Page F1	Doc F1	Kuiper ↓
Non-Agentic Systems
Gemini 3 Pro File Search	78.6 ± 2.2	74.1 ± 3.6	75.0 ± 3.6	70.1 ± 2.0	94.2 ± 1.0	–
Gemini 2.5 Flash File Search	71.8 ± 2.4	61.3 ± 4.1	73.0 ± 3.7	52.2 ± 2.2	80.9 ± 1.8	–
M3DocRAG	61.6 ± 2.6	31.0 ± 3.9	35.0 ± 4.0	68.2 ± 2.1	82.6 ± 1.7	–
GPT-5.2 (2024-08) HEAVEN	52.9 ± 2.7	38.9 ± 4.1	53.0 ± 4.2	48.4 ± 2.2	62.3 ± 2.2	–
Agentic Systems
Gemini 3 Pro BM25 Agent	82.2 ± 2.0	66.8 ± 3.9	73.0 ± 3.7	78.5 ± 1.8	90.2 ± 1.3	25.8
Claude Sonnet 4.5 (2025-09) BM25 Agent	80.6 ± 2.1	66.8 ± 3.9	82.0 ± 3.2	79.1 ± 1.8	93.0 ± 1.1	35.1
GPT-5 (2025-08) BM25 Agent	77.7 ± 2.2	60.1 ± 4.1	74.0 ± 3.7	74.2 ± 2.0	86.5 ± 1.5	52.6
Gemini 3 Pro RLM	73.8 ± 2.3	66.8 ± 3.9	66.0 ± 3.9	69.1 ± 2.1	89.8 ± 1.4	22.9
Human Performance
Human Oracle Retriever	99.4 ± 0.4	100.0	98.0 ± 1.2	–	–	–
Human BM25 Agent	82.2 ± 2.0	79.6 ± 3.4	72.0 ± 3.7	79.3 ± 1.8	93.4 ± 1.1	14.6

Agentic systems outperform static RAG: The best agent (Gemini 3 Pro BM25 Agent) achieves 82.2% accuracy, a 3.6% improvement over its non-agentic counterpart.
Oracle gap persists: Human Oracle Retriever achieves 99.4% accuracy, revealing a ~17% gap attributable to retrieval bottlenecks.
Specialized solutions perform well: M3DocRAG and MDocAgent (with 8B backbones) achieve >60% accuracy, rivaling larger commercial models.
Retrieval constraints are cost-effective: RLMs (e.g., Claude Sonnet 4.5 RLM) process massive tokens (~270M, cost $850) without outperforming constrained BM25 agents.
Calibration is distinct from accuracy: Kuiper scores vary widely and are not monotonic with accuracy. Humans have the best calibration (Kuiper 14.6).

Search Dynamics and Error Taxonomy

Analysis of BM25 MLLM Agent errors (3,273 total):

Failure modes: Retrieval failures (wrong document) 35.7%, comprehension failures (right page, wrong answer) 28.8%, navigation failures (right document, wrong page) 23.0%, refusals 12.6%.
Model-specific profiles: Weaker models (e.g., GPT-4.1 Nano) are dominated by refusals; stronger models (e.g., Claude Sonnet 4.5) shift toward comprehension errors (see Figure 8).
Query reformulation: Top-performing systems reformulate queries more aggressively (higher cosine drift). Claude Sonnet 4.5 has mean drift 0.38; GPT-4.1 Nano has drift 0.10.
Multi-hop difficulty: Driven by semantic distance between evidence pieces, not physical page gap. Accuracy drops 38 percentage points for conceptually dissimilar contexts.

Human-Agent Comparative Analysis

Same accuracy, different competencies: Humans and Gemini 3 Pro both achieve ~82% accuracy but have low pairwise item agreement (Cohen’s $\kappa = 0.24$ ). They succeed on different questions. Human-specific failures are dominated by comprehension errors (64%); model-specific failures split between retrieval (43%) and comprehension (43%).
"Cold Start" disparity: Humans achieve ~50% accuracy on their first query. Gemini 3 Pro starts at ~12%, requiring many steps to recover (see Figure 9).
Human calibration superior: Humans have the lowest Kuiper score (14.6), indicating better effort-accuracy alignment.
Response time: Humans median 2 minutes (mean 3.3). Accuracy inversely correlates with time: <1 min achieves 86%, >10 min achieves 68%.

Theoretical and Practical Implications

Benchmark Design: MADQA provides a rigorous, human-authored benchmark with principled splits based on Classical Test Theory, ensuring discriminative power and long-term relevance via a Sentinel Pool.
Evaluation Framework: The introduction of the Kuiper statistic provides a crucial new axis for evaluating agentic systems, moving beyond raw accuracy to assess calibration and efficiency.
System Design Insights:
- Agentic iteration is beneficial: Simple iterative agents outperform strong static RAG, confirming the value of planning.
- Constraints are necessary: Unconstrained RLMs are computationally inefficient without performance gains.
- Retrieval remains the bottleneck: The persistent ~20% oracle gap highlights that retrieval, not reasoning, is the primary challenge.
- Human-agent complementarity: Low agreement between humans and top models suggests hybrid pipelines could exceed the ceiling of either alone.
Future Directions: The findings suggest research should focus on episodic memory (to learn corpus structure) and reinforcement learning with search tool feedback (to improve exploration policies).

Conclusion

The paper concludes that even frontier MLLM agents, while capable of answering challenging document-grounded questions, expend substantial effort without reliably calibrating that effort to problem difficulty. They rely on brute-force search to compensate for weak strategic planning and fail to close the oracle performance gap.

The release of MADQA, along with its evaluation protocol and baseline implementations, aims to support the community's shift from brute-force retrieval to calibrated, efficient reasoning. Future evaluations will adapt to target new bottlenecks, ensuring the benchmark remains a discriminative signal for frontier capabilities.