Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Summary (Overview)

  • Introduces MADQA: A benchmark of 2,250 human-authored questions over 800 heterogeneous PDF documents designed to evaluate multimodal agentic systems' strategic reasoning over document collections.
  • Defines Agentic Document Collection VQA: A task formalized with six core properties (Extractive, Multi-Hop, Closed-World, Grounded, Agentic, Visual) distinguishing it from standard document QA.
  • Develops Novel Evaluation Protocol: Measures not only accuracy and attribution (Page/Doc F1) but also a novel calibration metric (Kuiper statistic) to assess the efficiency-effort trade-off of agentic behavior.
  • Key Finding: Best agents (e.g., Gemini 3 Pro BM25 Agent) can match human accuracy (~82%) but succeed on largely different questions and rely on brute-force search, failing to close a nearly 20% oracle gap. They exhibit poor effort calibration compared to humans.
  • Releases Dataset & Framework: Provides a validated benchmark, evaluation harness, and baseline implementations to facilitate research on efficient, strategic reasoning over documents.

Introduction and Theoretical Foundation

The paper addresses a critical gap in evaluating multimodal large language model (MLLM) based agentic systems for complex, multi-stage information retrieval and reasoning tasks in enterprise settings. Existing benchmarks are fragmented, suffering from limitations in:

  1. Format: Ignoring visual comprehension required for real-world PDFs.
  2. Scope: Being domain-specific or using single-step metrics that fail to capture iterative planning.
  3. Data Integrity: Using MLLM-generated questions/answers (introducing bias) or recycling documents (risk of contamination).

To address this, the authors introduce the Multimodal Agentic Document QA (MADQA) benchmark. The core task is defined as: Given a corpus CC of multi-page documents and a natural language query qq, produce an answer aa and a minimal evidence set ECE \subseteq C.

The task is distinguished by six formal properties:

#PropertyDefinition
1ExtractiveAnswer tokens must appear physically in the evidence set EE.
2Multi-HopEE may span disjoint pages (cross-page) or documents (cross-doc).
3Closed-WorldAnswer derived solely from CC; no external parametric knowledge.
4GroundedEE must entail aa and be minimal (no superfluous pages).
5AgenticNo single retrieval query qq' may exist such that RETRIEVE(q)E\text{RETRIEVE}(q') \supseteq E.
6VisualAnswering may require non-textual information (layout, tables, figures) in EE.

Properties 1, 3, and 4 are enforced by construction; properties 2, 5, and 6 are targeted by design. The agentic property necessitates planning, navigation, and aggregation.

Methodology

Dataset Construction

  • Documents: 800 PDFs manually curated from DocumentCloud, covering 13 high-level domains (Financial, Reports, Government, Legal, etc.) with high layout diversity (see Figure 2 heatmap). Documents range from single-page to 800+ pages.
  • Questions: 2,250 QA pairs authored by professional annotators following strict guidelines (answerable from documents, specific but not revealing source easily). Distribution: 82.7% single-hop, 17.3% multi-hop (8.3% cross-page, 9.0% cross-doc).
  • Quality Assurance: Two-step verification using GPT-5 with oracle evidence and manual review. Annotators with zero errors participated.
  • Construct Validity Analysis:
    • Lexical Overlap: Precision of unigram/bigram/trigram matching with gold evidence is very low, confirming need for semantic understanding.
    • Parametric Knowledge: Models' "guessability" (answering without documents) averages 11.2%, with ~8% attributed to training data contamination.
    • Visual Necessity: Only 42% of questions can be answered from free text alone; 58% benefit from understanding structured layouts, tabular data, or visual artifacts (see Figure 4).

Principled Splits Creation

Applying Classical Test Theory, questions were evaluated based on Difficulty (mean accuracy) and Discrimination (point-biserial correlation). A test set (n=500n=500) and development set (n=200n=200) were created to maximize discrimination. Crucially, 20% of the test set (100 items, the "Sentinel Pool") consists of items no current model can solve, ensuring long-term relevance. The remaining 1,550 items form a Train set. The Test set achieves strong rank correlation with the full benchmark (Spearman’s ρ>0.85\rho > 0.85).

Evaluation Protocol

  1. Answer Correctness (Extractive Property): Uses an LLM-based judge calibrated to human judgments. Achieves quadratic-weighted Cohen’s κ=0.88\kappa = 0.88 with human judgments on non-exact-match cases.
  2. Retrieval and Attribution (Grounded Property): Uses Page F1 (overlap between cited pages and minimal evidence set EE) and Doc F1 (document-level overlap). High Doc F1 with low Page F1 indicates "last-mile" navigation failure.
  3. Efficiency and Calibration (Agentic Property): Introduces a novel metric based on the Cumulative Difference method. Given evaluation tuples {(si,yi)}i=1N\{(s_i, y_i)\}_{i=1}^N with effort siNs_i \in \mathbb{N} and correctness yi{0,1}y_i \in \{0,1\}, sorted by nondecreasing effort via permutation π\pi, and mean accuracy yˉ=1Ni=1Nyi\bar{y} = \frac{1}{N}\sum_{i=1}^N y_i, the cumulative deviation curve is: D0=0,Dk=j=1k(yπ(j)yˉ)D_0 = 0, \quad D_k = \sum_{j=1}^k (y_{\pi(j)} - \bar{y}) The Kuiper range statistic quantifies the dependency between effort and accuracy: K=max0kNDkmin0kNDkK = \max_{0 \leq k \leq N} D_k - \min_{0 \leq k \leq N} D_k A low KK indicates stable, "effort-invariant" performance or good calibration. A high KK reveals poor calibration, where the agent expends significant budget on queries it fails to solve. Effort is measured as step counts (tool calls).

Baseline Approaches

Multiple baseline systems were evaluated:

  • BM25 MLLM Agent: Iterative system coupling text-based BM25 retrieval with a MLLM analyzing rendered page images.
  • Claude Agent with Semtools: Uses Claude Agents SDK with Unix-style tools (parse, search) and bash commands.
  • Recursive Language Models (RLM): Task-agnostic approach allowing LLMs to programmatically examine and recursively process the input.
  • MDocAgent: Fixed five-stage pipeline of specialized agents (General, Critical, Text, Image, Summarizing).
  • Managed RAG Services: "RAG-as-a-Service" solutions (Gemini File Search, OpenAI Assistants File Search).
  • M3DocRAG: Visual retrieval system encoding pages as images.
  • HEAVEN: Hybrid visual retrieval using DSE for candidate retrieval and ColQwen2.5 for re-ranking.
  • ColBERTv2 + LLaMA: Text-only late-interaction retrieval baseline.
  • Human Performance: Annotators using the same BM25 search engine ("Human BM25 Agent") and with oracle retrieval ("Human Oracle Retriever").

Empirical Validation / Results

Main Evaluation Results

Table 3 presents the main results. Key findings:

Table 3. Main evaluation results on MADQA.

Model / FrameworkAccuracyX-PageX-DocPage F1Doc F1Kuiper ↓
Non-Agentic Systems
Gemini 3 Pro File Search78.6 ± 2.274.1 ± 3.675.0 ± 3.670.1 ± 2.094.2 ± 1.0
Gemini 2.5 Flash File Search71.8 ± 2.461.3 ± 4.173.0 ± 3.752.2 ± 2.280.9 ± 1.8
M3DocRAG61.6 ± 2.631.0 ± 3.935.0 ± 4.068.2 ± 2.182.6 ± 1.7
GPT-5.2 (2024-08) HEAVEN52.9 ± 2.738.9 ± 4.153.0 ± 4.248.4 ± 2.262.3 ± 2.2
Agentic Systems
Gemini 3 Pro BM25 Agent82.2 ± 2.066.8 ± 3.973.0 ± 3.778.5 ± 1.890.2 ± 1.325.8
Claude Sonnet 4.5 (2025-09) BM25 Agent80.6 ± 2.166.8 ± 3.982.0 ± 3.279.1 ± 1.893.0 ± 1.135.1
GPT-5 (2025-08) BM25 Agent77.7 ± 2.260.1 ± 4.174.0 ± 3.774.2 ± 2.086.5 ± 1.552.6
Gemini 3 Pro RLM73.8 ± 2.366.8 ± 3.966.0 ± 3.969.1 ± 2.189.8 ± 1.422.9
Human Performance
Human Oracle Retriever99.4 ± 0.4100.098.0 ± 1.2
Human BM25 Agent82.2 ± 2.079.6 ± 3.472.0 ± 3.779.3 ± 1.893.4 ± 1.114.6
  • Agentic systems outperform static RAG: The best agent (Gemini 3 Pro BM25 Agent) achieves 82.2% accuracy, a 3.6% improvement over its non-agentic counterpart.
  • Oracle gap persists: Human Oracle Retriever achieves 99.4% accuracy, revealing a ~17% gap attributable to retrieval bottlenecks.
  • Specialized solutions perform well: M3DocRAG and MDocAgent (with 8B backbones) achieve >60% accuracy, rivaling larger commercial models.
  • Retrieval constraints are cost-effective: RLMs (e.g., Claude Sonnet 4.5 RLM) process massive tokens (~270M, cost $850) without outperforming constrained BM25 agents.
  • Calibration is distinct from accuracy: Kuiper scores vary widely and are not monotonic with accuracy. Humans have the best calibration (Kuiper 14.6).

Search Dynamics and Error Taxonomy

Analysis of BM25 MLLM Agent errors (3,273 total):

  • Failure modes: Retrieval failures (wrong document) 35.7%, comprehension failures (right page, wrong answer) 28.8%, navigation failures (right document, wrong page) 23.0%, refusals 12.6%.
  • Model-specific profiles: Weaker models (e.g., GPT-4.1 Nano) are dominated by refusals; stronger models (e.g., Claude Sonnet 4.5) shift toward comprehension errors (see Figure 8).
  • Query reformulation: Top-performing systems reformulate queries more aggressively (higher cosine drift). Claude Sonnet 4.5 has mean drift 0.38; GPT-4.1 Nano has drift 0.10.
  • Multi-hop difficulty: Driven by semantic distance between evidence pieces, not physical page gap. Accuracy drops 38 percentage points for conceptually dissimilar contexts.

Human-Agent Comparative Analysis

  • Same accuracy, different competencies: Humans and Gemini 3 Pro both achieve ~82% accuracy but have low pairwise item agreement (Cohen’s κ=0.24\kappa = 0.24). They succeed on different questions. Human-specific failures are dominated by comprehension errors (64%); model-specific failures split between retrieval (43%) and comprehension (43%).
  • "Cold Start" disparity: Humans achieve ~50% accuracy on their first query. Gemini 3 Pro starts at ~12%, requiring many steps to recover (see Figure 9).
  • Human calibration superior: Humans have the lowest Kuiper score (14.6), indicating better effort-accuracy alignment.
  • Response time: Humans median 2 minutes (mean 3.3). Accuracy inversely correlates with time: <1 min achieves 86%, >10 min achieves 68%.

Theoretical and Practical Implications

  • Benchmark Design: MADQA provides a rigorous, human-authored benchmark with principled splits based on Classical Test Theory, ensuring discriminative power and long-term relevance via a Sentinel Pool.
  • Evaluation Framework: The introduction of the Kuiper statistic provides a crucial new axis for evaluating agentic systems, moving beyond raw accuracy to assess calibration and efficiency.
  • System Design Insights:
    • Agentic iteration is beneficial: Simple iterative agents outperform strong static RAG, confirming the value of planning.
    • Constraints are necessary: Unconstrained RLMs are computationally inefficient without performance gains.
    • Retrieval remains the bottleneck: The persistent ~20% oracle gap highlights that retrieval, not reasoning, is the primary challenge.
    • Human-agent complementarity: Low agreement between humans and top models suggests hybrid pipelines could exceed the ceiling of either alone.
  • Future Directions: The findings suggest research should focus on episodic memory (to learn corpus structure) and reinforcement learning with search tool feedback (to improve exploration policies).

Conclusion

The paper concludes that even frontier MLLM agents, while capable of answering challenging document-grounded questions, expend substantial effort without reliably calibrating that effort to problem difficulty. They rely on brute-force search to compensate for weak strategic planning and fail to close the oracle performance gap.

The release of MADQA, along with its evaluation protocol and baseline implementations, aims to support the community's shift from brute-force retrieval to calibrated, efficient reasoning. Future evaluations will adapt to target new bottlenecks, ensuring the benchmark remains a discriminative signal for frontier capabilities.