Summary (Overview)
- PlanBench-XL is a new interactive benchmark for evaluating long-horizon adaptive planning by LLM tool-use agents in large-scale, retrieval-mediated tool ecosystems.
- It comprises 327 retail tasks over 1,665 tools, requiring agents to iteratively retrieve usable tools, infer implicit sub-goals, and adapt to dynamic environments.
- The benchmark introduces retrieval-time blocking with three failure types (explicit error, implicit silent failure, semantically misleading), forcing agents to detect disrupted paths and re-plan.
- Evaluation of 10 leading LLMs shows that even frontier models struggle: Gemini-3.1-Pro achieves 77.06% accuracy in the default (no-block) setting, but GPT-5.4 collapses from 51.90% to 11.36% under severe blocking.
- Key failure modes include trajectory drift (agents obtain partial evidence then diverge), silent failures (hardest to detect), and inability to re-plan through longer recovery paths.
Introduction and Theoretical Foundation
Background and Motivation
- Real-world LLM agents operate in large-scale tool ecosystems (enterprise MCP servers, software APIs, web platforms) where context-length limits force retrieval-mediated tool access – only a relevant subset of tools is visible at each step.
- Long-horizon tasks require agents to explore intermediate sub-goals, discover tools incrementally, and adapt plans as new information emerges.
- Existing benchmarks assume fixed visible toolsets, explicit sub-goals, clean tool descriptions, or one-shot retrieval – they fail to capture the uncertainty introduced by partial tool visibility and unreliable tool retrieval.
Central Research Question
"Can LLM agents solve long-horizon tasks in large tool ecosystems by iteratively exploring partial tool retrieval results and adapting when plausible tool-use paths fail?"
Key Requirements for a New Benchmark
- Partial tool visibility: agents access only retrieved subsets of a large tool space and must iteratively discover tools for intermediate information.
- Unreliable/noisy tools: retrieved tools may be missing, failing, or misleading, requiring runtime adaptation.
Comparison with Prior Work (Table 1)
| Benchmark | Tool-Use | Tool Retrieval | Implicit Sub-goals | Bi-directional Exploration | Unreliable Tools | Long-Horizon | Scalable Generation |
|---|---|---|---|---|---|---|---|
| ToolBench (Qin et al., 2023a) | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ |
| ... | ... | ... | ... | ... | ... | ... | ... |
| PlanBench-XL (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
PlanBench-XL is the only benchmark that fully addresses all seven traits.
Methodology
Environment Setup
Tool Library Construction
- Define a set of typed retail datatypes (e.g.,
person_name,refund_status). Initial set proposed by generation LLM , then filtered by another LLM to remove vague/redundant/unrealistic types. - Construct candidate tools by considering all pairs of input/output datatype sets where , . proposes tool functionality; final library obtained after filtering.
- Augment with noisy tools: semantically similar but explicitly disclose unavailability/unreliability in descriptions.
Backend Database
- For each retail case, instantiate values for the full datatype set, producing a complete structured record. Tool execution matches input arguments against the record and returns output values.
- Ensures answers cannot be inferred from common sense – tool use is necessary.
Query Generation (Three-Step Pipeline)
- Specify task internally as where is initial input datatype set, is target datatype set.
- Compute ground-truth tool-call sequences via state-graph reachability; discard unsolvable tasks.
- Instantiate with concrete values, verbalize as natural-language query , derive ground-truth answer following one valid sequence .
Environment Parameters (Table 2)
| Parameter | Value |
|---|---|
| Number of datatypes | 56 |
| Number of queries | 327 |
| Number of tools | 1,665 |
| Shortest path length () | 5–9 |
| Maximum turns () | 100 |
| Per-retrieval return cap () | 30 |
| Global seed () | 42 |
Agent–Environment Interaction
- At each step, agent outputs one of three actions:
retrieve,tool-call, oranswer. Environment responds accordingly. - Agent state where is query, is discovered callable tools, is obtained datatypes.
- Tool Retriever supports three modes aligned with bi-directional anticipation:
- Input-conditioned retrieval (Forward Anticipation): "What can be reached from current evidence?"
- Output-conditioned retrieval (Backward Anticipation): "What tools lead to desired outcome?"
- Input-output-conditioned retrieval (Bridging): specify both available info and desired result.
- Retrieval-Time Blocking – an optional module that replaces path-critical tools with alternatives:
- Explicit Failure Blocks: return error messages (e.g.,
error: endpoint unavailable). - Implicit Failure Blocks: return unhelpful responses that silently violate documented behavior (e.g., wrong fixed value like
refund_status = tuna). - Semantically Misleading Blocks: return tools with related-but-different functionality (e.g.,
get_order_statusinstead ofget_refund_status). - Each blocked instance preserves at least one feasible valid path.
- Explicit Failure Blocks: return error messages (e.g.,
Metrics
- Accuracy (%): proportion of queries with correct final answer.
- Executed Ground-Truth Datatype Precision (EGT Prec. %): fraction of unique datatypes produced by executed tool calls that belong to ground-truth datatype set.
- Average Turns (Avg. Turns): average interaction turns per query.
- Mean Explored Datatypes (Mean EDT): average number of new datatypes uncovered beyond initial input datatypes.
- Search-to-Call Ratio (S/C Ratio): tool-retrieval / tool-call turn ratio.
- Invalid Tool Call Rate (ITCR %): fraction of structurally/procedurally invalid calls.
- Untrusted Input Rejection Rate (UIRR %): fraction of tool calls rejected because argument value comes from a noisy tool response.
Empirical Validation / Results
Main Results (Default Setting, Table 3)
| Model | Accuracy (%) ↑ | EGT Prec. (%) ↑ | Avg. Turns | Mean EDT | S/C Ratio | ITCR (%) ↓ | UIRR (%) ↓ |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | 0.00 | 35.31 | 25.65 | 7.64 | 0.20 | 6.11 | 0.10 |
| Qwen3-14B | 0.92 | 47.77 | 35.74 | 12.01 | 0.09 | 3.94 | 0.93 |
| Qwen3-32B | 2.75 | 62.36 | 12.03 | 18.54 | 1.59 | 10.05 | 7.43 |
| Llama-3.1-8B-Instruct | 0.00 | 41.33 | 21.62 | 9.89 | 1.49 | 18.03 | 5.25 |
| Llama-3.3-70B-Instruct | 18.96 | 59.67 | 19.13 | 19.20 | 2.22 | 21.47 | 2.13 |
| DeepSeek-V4-Flash | 63.08 | 65.57 | 31.41 | 25.34 | 2.80 | 8.27 | 3.29 |
| Gemini-3.1-Pro | 77.06 | 91.47 | 19.55 | 27.41 | 1.59 | 0.68 | 0.30 |
| Gemini-3.5-Flash | 52.19 | 85.29 | 57.87 | 25.16 | 10.44 | 2.94 | 0.00 |
| GPT-5.4-Mini | 3.07 | 71.25 | 10.81 | 9.22 | 1.97 | 51.71 | 4.42 |
| GPT-5.4 | 51.90 | 72.92 | 22.92 | 20.65 | 2.70 | 6.28 | 1.91 |
Key Findings (Takeaways 1–4):
- Takeaway 1: Massive-tool planning remains highly challenging – only Gemini-3.1-Pro exceeds 60% accuracy; small models (8B) achieve 0%.
- Takeaway 2: Broad exploration (Mean EDT) strongly correlates with accuracy (Pearson ), but frequent retrieval alone does not guarantee success. Bi-directional exploration matters: output-conditioned retrieval frequency correlates with accuracy ().
- Takeaway 3: Precise exploitation (EGT Precision) is also critical (). Top models combine broad exploration with high execution relevance.
- Takeaway 4: Basic tool reliability (low ITCR, low UIRR) is necessary – invalid calls reduce accuracy ().
Blocking Analysis (Takeaway 5, Figures 2–3)
Viable-path reduction sharply weakens performance: As block ratio increases (from 0 to 0.8, then to 1 path), accuracy drops across all models. GPT-5.4 falls from 51.90% to ~30% (1 path) and to 11.36% (longest path preserved).
Silent tool failures are most harmful: Implicit failures yield the lowest accuracy among single-type perturbations. UIRR is highest under implicit failures (11.99%) vs. explicit (9.67%) and misleading (9.89%). Silent failures inject invalid values that propagate into later tool calls.
Agents struggle with longer recovery paths: When only the longest valid path remains, accuracy drops sharply – GPT-5.4 falls to slightly above 10%, vs. ~30% under standard blocking.
Inference-Time Augmentation (Takeaway 6, Figure 4)
- Enforced exploration (adding continuation prompts after incorrect termination) yields only limited gains (<5 percentage points). Block-setting performance remains far below no-block accuracy, indicating deeper limitations in adaptive re-planning.
Path Length Effects (Takeaway 7, Figure 5)
- Accuracy decreases as shortest valid solution length increases, both in default and block settings. Longer minimal horizons amplify difficulty.
Error Analysis (Takeaways 8–11)
- Most failures occur after partial progress: 72.4% of GPT-5.4 failures are "Irrecoverable Drift" – the agent makes progress, then a non-progress call, and never recovers.
- Drift is a tool-selection failure, not retrieval failure: In 78.0% of default cases, a valid progress tool had already been retrieved before the non-progress call.
- Recency bias: Models over-select recently retrieved tools; older valid tools are often ignored even when they reappear.
- Semantically misleading blocks are rarely invoked (≤3%), but implicit failure blocks are frequently invoked and cause value contamination (42.2% Value Reused across models).
- Model-specific termination policies:
- GPT-5.4: Surrenders (77.3% of failures) despite solvability.
- DeepSeek-V4-Flash, Llama-3.3-70B-Instruct: Commit wrong tool values (58.8% and 81.7% respectively).
- Gemini-3.5-Flash: Search exhaustion (90.8% of failures) – keeps searching without progress.
Theoretical and Practical Implications
Theoretical Significance
- Confirms that adaptive planning under partial observability is a distinct challenge from tool selection or task decomposition alone. The concept of bi-directional anticipation (forward from known evidence, backward from desired outcome) is formalized and empirically shown to be necessary.
- Highlights a hierarchical failure model: (1) solution-path execution weakness → (2) amplified by corrupted tools → (3) manifested through model-specific termination biases.
Practical Implications
- LLM agents deployed in real-world tool ecosystems (e.g., enterprise MCP servers, web APIs) will face exactly the conditions in PlanBench-XL: partial visibility, unreliable retrieval, and silent tool failures.
- Silent failures are especially dangerous – agents need mechanisms to detect and recover from superficially plausible but wrong tool outputs.
- Test-time compute scaling is insufficient – deeper architectural changes (e.g., dedicated planning modules, uncertainty-aware re-selection) are needed.
- The benchmark can serve as an RL playground for training agents that actively explore, detect unreliability, and re-plan.
Conclusion
PlanBench-XL introduces a scalable, interactive benchmark for evaluating long-horizon adaptive planning in massive, retrieval-mediated tool ecosystems. Through 327 retail tasks over 1,665 tools with bi-directional exploration and retrieval-time blocking, the benchmark reveals that current LLM agents remain brittle:
- Even frontier models struggle with reliable tool discovery and exploitation.
- Silent failures disrupt planning more than explicit errors or misleading tools.
- Longer recovery paths amplify failure rates, and extra interaction brings only marginal gains.
- The dominant failure mode is trajectory drift: agents make partial progress, then diverge and rarely recover.
Future directions (Appendix F.3) include extending to multiple domains, integrating more realistic dynamic failures, and using PlanBench-XL as a training environment for robust agentic planning.
Related papers
- HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
HarnessX evolves the agent harness as a typed, first-class interface, achieving average +14.5% and up to +44% gains across benchmarks.
- Orchestra-o1: Omnimodal Agent Orchestration
Orchestra-o1 achieves 72.8% accuracy on OmniGAIA, surpassing prior best by 10.3% via modality-aware orchestration and DA-GRPO training.
- Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
CARVE outperforms VideoRAG baselines by selecting per-chunk modality and granularity via chunk-adaptive reranking, achieving 0.603 Recall@5 on V-RAGBench.