Summary (Overview)

  • PlanBench-XL is a new interactive benchmark for evaluating long-horizon adaptive planning by LLM tool-use agents in large-scale, retrieval-mediated tool ecosystems.
  • It comprises 327 retail tasks over 1,665 tools, requiring agents to iteratively retrieve usable tools, infer implicit sub-goals, and adapt to dynamic environments.
  • The benchmark introduces retrieval-time blocking with three failure types (explicit error, implicit silent failure, semantically misleading), forcing agents to detect disrupted paths and re-plan.
  • Evaluation of 10 leading LLMs shows that even frontier models struggle: Gemini-3.1-Pro achieves 77.06% accuracy in the default (no-block) setting, but GPT-5.4 collapses from 51.90% to 11.36% under severe blocking.
  • Key failure modes include trajectory drift (agents obtain partial evidence then diverge), silent failures (hardest to detect), and inability to re-plan through longer recovery paths.

Introduction and Theoretical Foundation

Background and Motivation

  • Real-world LLM agents operate in large-scale tool ecosystems (enterprise MCP servers, software APIs, web platforms) where context-length limits force retrieval-mediated tool access – only a relevant subset of tools is visible at each step.
  • Long-horizon tasks require agents to explore intermediate sub-goals, discover tools incrementally, and adapt plans as new information emerges.
  • Existing benchmarks assume fixed visible toolsets, explicit sub-goals, clean tool descriptions, or one-shot retrieval – they fail to capture the uncertainty introduced by partial tool visibility and unreliable tool retrieval.

Central Research Question

"Can LLM agents solve long-horizon tasks in large tool ecosystems by iteratively exploring partial tool retrieval results and adapting when plausible tool-use paths fail?"

Key Requirements for a New Benchmark

  1. Partial tool visibility: agents access only retrieved subsets of a large tool space and must iteratively discover tools for intermediate information.
  2. Unreliable/noisy tools: retrieved tools may be missing, failing, or misleading, requiring runtime adaptation.

Comparison with Prior Work (Table 1)

BenchmarkTool-UseTool RetrievalImplicit Sub-goalsBi-directional ExplorationUnreliable ToolsLong-HorizonScalable Generation
ToolBench (Qin et al., 2023a)
........................
PlanBench-XL (Ours)

PlanBench-XL is the only benchmark that fully addresses all seven traits.


Methodology

Environment Setup

Tool Library Construction

  • Define a set of typed retail datatypes DD (e.g., person_name, refund_status). Initial set proposed by generation LLM MgenM_{gen}, then filtered by another LLM MfilM_{fil} to remove vague/redundant/unrealistic types.
  • Construct candidate tools by considering all pairs of input/output datatype sets (Din,Dout)(D_{in}, D_{out}) where Din=m|D_{in}| = m, Dout=n|D_{out}| = n. MgenM_{gen} proposes tool functionality; final library TT obtained after filtering.
  • Augment with noisy tools: semantically similar but explicitly disclose unavailability/unreliability in descriptions.

Backend Database

  • For each retail case, instantiate values for the full datatype set, producing a complete structured record. Tool execution matches input arguments against the record and returns output values.
  • Ensures answers cannot be inferred from common sense – tool use is necessary.

Query Generation (Three-Step Pipeline)

  1. Specify task internally as r=(D0,Y)r = (D_0, Y) where D0D_0 is initial input datatype set, YY is target datatype set.
  2. Compute ground-truth tool-call sequences Π(r)\Pi(r) via state-graph reachability; discard unsolvable tasks.
  3. Instantiate with concrete values, verbalize as natural-language query qq, derive ground-truth answer oo^* following one valid sequence πΠ(r)\pi \in \Pi(r).

Environment Parameters (Table 2)

ParameterValue
Number of datatypes56
Number of queries327
Number of tools1,665
Shortest path length (LL^*)5–9
Maximum turns (TmaxT_{max})100
Per-retrieval return cap (Λretcap\Lambda^{cap}_{ret})30
Global seed (σ0\sigma_0)42

Agent–Environment Interaction

  • At each step, agent outputs one of three actions: retrieve, tool-call, or answer. Environment responds accordingly.
  • Agent state st=(q,Ut,Dt)s_t = (q, U_t, D_t) where qq is query, UtU_t is discovered callable tools, DtDD_t \subseteq D is obtained datatypes.
  • Tool Retriever supports three modes aligned with bi-directional anticipation:
    • Input-conditioned retrieval (Forward Anticipation): "What can be reached from current evidence?"
    • Output-conditioned retrieval (Backward Anticipation): "What tools lead to desired outcome?"
    • Input-output-conditioned retrieval (Bridging): specify both available info and desired result.
  • Retrieval-Time Blocking – an optional module that replaces path-critical tools with alternatives:
    • Explicit Failure Blocks: return error messages (e.g., error: endpoint unavailable).
    • Implicit Failure Blocks: return unhelpful responses that silently violate documented behavior (e.g., wrong fixed value like refund_status = tuna).
    • Semantically Misleading Blocks: return tools with related-but-different functionality (e.g., get_order_status instead of get_refund_status).
    • Each blocked instance preserves at least one feasible valid path.

Metrics

  1. Accuracy (%): proportion of queries with correct final answer.
  2. Executed Ground-Truth Datatype Precision (EGT Prec. %): fraction of unique datatypes produced by executed tool calls that belong to ground-truth datatype set.
  3. Average Turns (Avg. Turns): average interaction turns per query.
  4. Mean Explored Datatypes (Mean EDT): average number of new datatypes uncovered beyond initial input datatypes.
  5. Search-to-Call Ratio (S/C Ratio): tool-retrieval / tool-call turn ratio.
  6. Invalid Tool Call Rate (ITCR %): fraction of structurally/procedurally invalid calls.
  7. Untrusted Input Rejection Rate (UIRR %): fraction of tool calls rejected because argument value comes from a noisy tool response.

Empirical Validation / Results

Main Results (Default Setting, Table 3)

ModelAccuracy (%) ↑EGT Prec. (%) ↑Avg. TurnsMean EDTS/C RatioITCR (%) ↓UIRR (%) ↓
Qwen3-8B0.0035.3125.657.640.206.110.10
Qwen3-14B0.9247.7735.7412.010.093.940.93
Qwen3-32B2.7562.3612.0318.541.5910.057.43
Llama-3.1-8B-Instruct0.0041.3321.629.891.4918.035.25
Llama-3.3-70B-Instruct18.9659.6719.1319.202.2221.472.13
DeepSeek-V4-Flash63.0865.5731.4125.342.808.273.29
Gemini-3.1-Pro77.0691.4719.5527.411.590.680.30
Gemini-3.5-Flash52.1985.2957.8725.1610.442.940.00
GPT-5.4-Mini3.0771.2510.819.221.9751.714.42
GPT-5.451.9072.9222.9220.652.706.281.91

Key Findings (Takeaways 1–4):

  • Takeaway 1: Massive-tool planning remains highly challenging – only Gemini-3.1-Pro exceeds 60% accuracy; small models (8B) achieve 0%.
  • Takeaway 2: Broad exploration (Mean EDT) strongly correlates with accuracy (Pearson r=0.902r = 0.902), but frequent retrieval alone does not guarantee success. Bi-directional exploration matters: output-conditioned retrieval frequency correlates with accuracy (r=0.800r = 0.800).
  • Takeaway 3: Precise exploitation (EGT Precision) is also critical (r=0.781r = 0.781). Top models combine broad exploration with high execution relevance.
  • Takeaway 4: Basic tool reliability (low ITCR, low UIRR) is necessary – invalid calls reduce accuracy (r=0.443r = -0.443).

Blocking Analysis (Takeaway 5, Figures 2–3)

Viable-path reduction sharply weakens performance: As block ratio increases (from 0 to 0.8, then to 1 path), accuracy drops across all models. GPT-5.4 falls from 51.90% to ~30% (1 path) and to 11.36% (longest path preserved).

Silent tool failures are most harmful: Implicit failures yield the lowest accuracy among single-type perturbations. UIRR is highest under implicit failures (11.99%) vs. explicit (9.67%) and misleading (9.89%). Silent failures inject invalid values that propagate into later tool calls.

Agents struggle with longer recovery paths: When only the longest valid path remains, accuracy drops sharply – GPT-5.4 falls to slightly above 10%, vs. ~30% under standard blocking.

Inference-Time Augmentation (Takeaway 6, Figure 4)

  • Enforced exploration (adding continuation prompts after incorrect termination) yields only limited gains (<5 percentage points). Block-setting performance remains far below no-block accuracy, indicating deeper limitations in adaptive re-planning.

Path Length Effects (Takeaway 7, Figure 5)

  • Accuracy decreases as shortest valid solution length LL^* increases, both in default and block settings. Longer minimal horizons amplify difficulty.

Error Analysis (Takeaways 8–11)

  • Most failures occur after partial progress: 72.4% of GPT-5.4 failures are "Irrecoverable Drift" – the agent makes progress, then a non-progress call, and never recovers.
  • Drift is a tool-selection failure, not retrieval failure: In 78.0% of default cases, a valid progress tool had already been retrieved before the non-progress call.
  • Recency bias: Models over-select recently retrieved tools; older valid tools are often ignored even when they reappear.
  • Semantically misleading blocks are rarely invoked (≤3%), but implicit failure blocks are frequently invoked and cause value contamination (42.2% Value Reused across models).
  • Model-specific termination policies:
    • GPT-5.4: Surrenders (77.3% of failures) despite solvability.
    • DeepSeek-V4-Flash, Llama-3.3-70B-Instruct: Commit wrong tool values (58.8% and 81.7% respectively).
    • Gemini-3.5-Flash: Search exhaustion (90.8% of failures) – keeps searching without progress.

Theoretical and Practical Implications

Theoretical Significance

  • Confirms that adaptive planning under partial observability is a distinct challenge from tool selection or task decomposition alone. The concept of bi-directional anticipation (forward from known evidence, backward from desired outcome) is formalized and empirically shown to be necessary.
  • Highlights a hierarchical failure model: (1) solution-path execution weakness → (2) amplified by corrupted tools → (3) manifested through model-specific termination biases.

Practical Implications

  • LLM agents deployed in real-world tool ecosystems (e.g., enterprise MCP servers, web APIs) will face exactly the conditions in PlanBench-XL: partial visibility, unreliable retrieval, and silent tool failures.
  • Silent failures are especially dangerous – agents need mechanisms to detect and recover from superficially plausible but wrong tool outputs.
  • Test-time compute scaling is insufficient – deeper architectural changes (e.g., dedicated planning modules, uncertainty-aware re-selection) are needed.
  • The benchmark can serve as an RL playground for training agents that actively explore, detect unreliability, and re-plan.

Conclusion

PlanBench-XL introduces a scalable, interactive benchmark for evaluating long-horizon adaptive planning in massive, retrieval-mediated tool ecosystems. Through 327 retail tasks over 1,665 tools with bi-directional exploration and retrieval-time blocking, the benchmark reveals that current LLM agents remain brittle:

  • Even frontier models struggle with reliable tool discovery and exploitation.
  • Silent failures disrupt planning more than explicit errors or misleading tools.
  • Longer recovery paths amplify failure rates, and extra interaction brings only marginal gains.
  • The dominant failure mode is trajectory drift: agents make partial progress, then diverge and rarely recover.

Future directions (Appendix F.3) include extending to multiple domains, integrating more realistic dynamic failures, and using PlanBench-XL as a training environment for robust agentic planning.

Related papers