Visual Summary | PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Summary (Overview)

PlanBench-XL is a new interactive benchmark for evaluating long-horizon adaptive planning by LLM tool-use agents in large-scale, retrieval-mediated tool ecosystems.
It comprises 327 retail tasks over 1,665 tools, requiring agents to iteratively retrieve usable tools, infer implicit sub-goals, and adapt to dynamic environments.
The benchmark introduces retrieval-time blocking with three failure types (explicit error, implicit silent failure, semantically misleading), forcing agents to detect disrupted paths and re-plan.
Evaluation of 10 leading LLMs shows that even frontier models struggle: Gemini-3.1-Pro achieves 77.06% accuracy in the default (no-block) setting, but GPT-5.4 collapses from 51.90% to 11.36% under severe blocking.
Key failure modes include trajectory drift (agents obtain partial evidence then diverge), silent failures (hardest to detect), and inability to re-plan through longer recovery paths.

Introduction and Theoretical Foundation

Background and Motivation

Real-world LLM agents operate in large-scale tool ecosystems (enterprise MCP servers, software APIs, web platforms) where context-length limits force retrieval-mediated tool access – only a relevant subset of tools is visible at each step.
Long-horizon tasks require agents to explore intermediate sub-goals, discover tools incrementally, and adapt plans as new information emerges.
Existing benchmarks assume fixed visible toolsets, explicit sub-goals, clean tool descriptions, or one-shot retrieval – they fail to capture the uncertainty introduced by partial tool visibility and unreliable tool retrieval.

Central Research Question

"Can LLM agents solve long-horizon tasks in large tool ecosystems by iteratively exploring partial tool retrieval results and adapting when plausible tool-use paths fail?"

Key Requirements for a New Benchmark

Partial tool visibility: agents access only retrieved subsets of a large tool space and must iteratively discover tools for intermediate information.
Unreliable/noisy tools: retrieved tools may be missing, failing, or misleading, requiring runtime adaptation.

Comparison with Prior Work (Table 1)

Benchmark	Tool-Use	Tool Retrieval	Implicit Sub-goals	Bi-directional Exploration	Unreliable Tools	Long-Horizon	Scalable Generation
ToolBench (Qin et al., 2023a)	✓	✓	✗	✗	✗	✗	✓
...	...	...	...	...	...	...	...
PlanBench-XL (Ours)	✓	✓	✓	✓	✓	✓	✓

PlanBench-XL is the only benchmark that fully addresses all seven traits.

Methodology

Environment Setup

Tool Library Construction

Define a set of typed retail datatypes $D$ (e.g., person_name, refund_status). Initial set proposed by generation LLM $M_{gen}$ , then filtered by another LLM $M_{fil}$ to remove vague/redundant/unrealistic types.
Construct candidate tools by considering all pairs of input/output datatype sets $(D_{in}, D_{out})$ where $|D_{in}| = m$ , $|D_{out}| = n$ . $M_{gen}$ proposes tool functionality; final library $T$ obtained after filtering.
Augment with noisy tools: semantically similar but explicitly disclose unavailability/unreliability in descriptions.

Backend Database

For each retail case, instantiate values for the full datatype set, producing a complete structured record. Tool execution matches input arguments against the record and returns output values.
Ensures answers cannot be inferred from common sense – tool use is necessary.

Query Generation (Three-Step Pipeline)

Specify task internally as $r = (D_0, Y)$ where $D_0$ is initial input datatype set, $Y$ is target datatype set.
Compute ground-truth tool-call sequences $\Pi(r)$ via state-graph reachability; discard unsolvable tasks.
Instantiate with concrete values, verbalize as natural-language query $q$ , derive ground-truth answer $o^*$ following one valid sequence $\pi \in \Pi(r)$ .

Environment Parameters (Table 2)

Parameter	Value
Number of datatypes	56
Number of queries	327
Number of tools	1,665
Shortest path length ( $L^*$ )	5–9
Maximum turns ( $T_{max}$ )	100
Per-retrieval return cap ( $\Lambda^{cap}_{ret}$ )	30
Global seed ( $\sigma_0$ )	42

Agent–Environment Interaction

At each step, agent outputs one of three actions: retrieve, tool-call, or answer. Environment responds accordingly.
Agent state $s_t = (q, U_t, D_t)$ where $q$ is query, $U_t$ is discovered callable tools, $D_t \subseteq D$ is obtained datatypes.
Tool Retriever supports three modes aligned with bi-directional anticipation:
- Input-conditioned retrieval (Forward Anticipation): "What can be reached from current evidence?"
- Output-conditioned retrieval (Backward Anticipation): "What tools lead to desired outcome?"
- Input-output-conditioned retrieval (Bridging): specify both available info and desired result.
Retrieval-Time Blocking – an optional module that replaces path-critical tools with alternatives:
- Explicit Failure Blocks: return error messages (e.g., error: endpoint unavailable).
- Implicit Failure Blocks: return unhelpful responses that silently violate documented behavior (e.g., wrong fixed value like refund_status = tuna).
- Semantically Misleading Blocks: return tools with related-but-different functionality (e.g., get_order_status instead of get_refund_status).
- Each blocked instance preserves at least one feasible valid path.

Metrics

Accuracy (%): proportion of queries with correct final answer.
Executed Ground-Truth Datatype Precision (EGT Prec. %): fraction of unique datatypes produced by executed tool calls that belong to ground-truth datatype set.
Average Turns (Avg. Turns): average interaction turns per query.
Mean Explored Datatypes (Mean EDT): average number of new datatypes uncovered beyond initial input datatypes.
Search-to-Call Ratio (S/C Ratio): tool-retrieval / tool-call turn ratio.
Invalid Tool Call Rate (ITCR %): fraction of structurally/procedurally invalid calls.
Untrusted Input Rejection Rate (UIRR %): fraction of tool calls rejected because argument value comes from a noisy tool response.

Empirical Validation / Results

Main Results (Default Setting, Table 3)

Model	Accuracy (%) ↑	EGT Prec. (%) ↑	Avg. Turns	Mean EDT	S/C Ratio	ITCR (%) ↓	UIRR (%) ↓
Qwen3-8B	0.00	35.31	25.65	7.64	0.20	6.11	0.10
Qwen3-14B	0.92	47.77	35.74	12.01	0.09	3.94	0.93
Qwen3-32B	2.75	62.36	12.03	18.54	1.59	10.05	7.43
Llama-3.1-8B-Instruct	0.00	41.33	21.62	9.89	1.49	18.03	5.25
Llama-3.3-70B-Instruct	18.96	59.67	19.13	19.20	2.22	21.47	2.13
DeepSeek-V4-Flash	63.08	65.57	31.41	25.34	2.80	8.27	3.29
Gemini-3.1-Pro	77.06	91.47	19.55	27.41	1.59	0.68	0.30
Gemini-3.5-Flash	52.19	85.29	57.87	25.16	10.44	2.94	0.00
GPT-5.4-Mini	3.07	71.25	10.81	9.22	1.97	51.71	4.42
GPT-5.4	51.90	72.92	22.92	20.65	2.70	6.28	1.91

Key Findings (Takeaways 1–4):

Takeaway 1: Massive-tool planning remains highly challenging – only Gemini-3.1-Pro exceeds 60% accuracy; small models (8B) achieve 0%.
Takeaway 2: Broad exploration (Mean EDT) strongly correlates with accuracy (Pearson $r = 0.902$ ), but frequent retrieval alone does not guarantee success. Bi-directional exploration matters: output-conditioned retrieval frequency correlates with accuracy ( $r = 0.800$ ).
Takeaway 3: Precise exploitation (EGT Precision) is also critical ( $r = 0.781$ ). Top models combine broad exploration with high execution relevance.
Takeaway 4: Basic tool reliability (low ITCR, low UIRR) is necessary – invalid calls reduce accuracy ( $r = -0.443$ ).

Blocking Analysis (Takeaway 5, Figures 2–3)

Viable-path reduction sharply weakens performance: As block ratio increases (from 0 to 0.8, then to 1 path), accuracy drops across all models. GPT-5.4 falls from 51.90% to ~30% (1 path) and to 11.36% (longest path preserved).

Silent tool failures are most harmful: Implicit failures yield the lowest accuracy among single-type perturbations. UIRR is highest under implicit failures (11.99%) vs. explicit (9.67%) and misleading (9.89%). Silent failures inject invalid values that propagate into later tool calls.

Agents struggle with longer recovery paths: When only the longest valid path remains, accuracy drops sharply – GPT-5.4 falls to slightly above 10%, vs. ~30% under standard blocking.

Inference-Time Augmentation (Takeaway 6, Figure 4)

Enforced exploration (adding continuation prompts after incorrect termination) yields only limited gains (<5 percentage points). Block-setting performance remains far below no-block accuracy, indicating deeper limitations in adaptive re-planning.

Path Length Effects (Takeaway 7, Figure 5)

Accuracy decreases as shortest valid solution length $L^*$ increases, both in default and block settings. Longer minimal horizons amplify difficulty.

Error Analysis (Takeaways 8–11)

Most failures occur after partial progress: 72.4% of GPT-5.4 failures are "Irrecoverable Drift" – the agent makes progress, then a non-progress call, and never recovers.
Drift is a tool-selection failure, not retrieval failure: In 78.0% of default cases, a valid progress tool had already been retrieved before the non-progress call.
Recency bias: Models over-select recently retrieved tools; older valid tools are often ignored even when they reappear.
Semantically misleading blocks are rarely invoked (≤3%), but implicit failure blocks are frequently invoked and cause value contamination (42.2% Value Reused across models).
Model-specific termination policies:
- GPT-5.4: Surrenders (77.3% of failures) despite solvability.
- DeepSeek-V4-Flash, Llama-3.3-70B-Instruct: Commit wrong tool values (58.8% and 81.7% respectively).
- Gemini-3.5-Flash: Search exhaustion (90.8% of failures) – keeps searching without progress.

Theoretical and Practical Implications

Theoretical Significance

Confirms that adaptive planning under partial observability is a distinct challenge from tool selection or task decomposition alone. The concept of bi-directional anticipation (forward from known evidence, backward from desired outcome) is formalized and empirically shown to be necessary.
Highlights a hierarchical failure model: (1) solution-path execution weakness → (2) amplified by corrupted tools → (3) manifested through model-specific termination biases.

Practical Implications

LLM agents deployed in real-world tool ecosystems (e.g., enterprise MCP servers, web APIs) will face exactly the conditions in PlanBench-XL: partial visibility, unreliable retrieval, and silent tool failures.
Silent failures are especially dangerous – agents need mechanisms to detect and recover from superficially plausible but wrong tool outputs.
Test-time compute scaling is insufficient – deeper architectural changes (e.g., dedicated planning modules, uncertainty-aware re-selection) are needed.
The benchmark can serve as an RL playground for training agents that actively explore, detect unreliability, and re-plan.

Conclusion

PlanBench-XL introduces a scalable, interactive benchmark for evaluating long-horizon adaptive planning in massive, retrieval-mediated tool ecosystems. Through 327 retail tasks over 1,665 tools with bi-directional exploration and retrieval-time blocking, the benchmark reveals that current LLM agents remain brittle:

Even frontier models struggle with reliable tool discovery and exploitation.
Silent failures disrupt planning more than explicit errors or misleading tools.
Longer recovery paths amplify failure rates, and extra interaction brings only marginal gains.
The dominant failure mode is trajectory drift: agents make partial progress, then diverge and rarely recover.

Future directions (Appendix F.3) include extending to multiple domains, integrating more realistic dynamic failures, and using PlanBench-XL as a training environment for robust agentic planning.