FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use - Summary

Summary (Overview)

First Real-World, Runnable Financial Tool Benchmark: Introduces FinToolBench, a benchmark comprising 760 executable financial tools and 295 tool-required queries, moving beyond static QA to evaluate agents in a realistic, execution-grounded environment.
Finance-Aware Evaluation Framework: Proposes novel evaluation metrics that go beyond binary execution success to assess compliance with financial constraints: Timeliness Mismatch Rate (TMR), Intent Mismatch Rate (IMR), and Domain Mismatch Rate (DMR).
Baseline Agent (FATR): Presents FATR (Finance-Aware Tool Retrieval and Reasoning), a lightweight baseline that injects finance-specific attributes into tool cards and employs stabilization techniques to improve agent reliability and compliance.
Key Findings: Experiments reveal a trade-off between aggressive tool use and precision. Models like Qwen3-8B show high tool invocation but low execution success, while GPT-4o is highly precise but conservative, often failing to invoke necessary tools. Explicit finance attribute injection improves conditional execution reliability and reduces compliance mismatches.

Introduction and Theoretical Foundation

The integration of Large Language Models (LLMs) into finance is shifting the paradigm from passive information retrieval to dynamic, agentic interaction. However, existing financial benchmarks primarily focus on static textual analysis or document-based QA, ignoring the complexities of real-world tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, which is characterized by high stakes, strict compliance, and rapid data volatility.

Current evaluation methods are blind to three critical failure modes essential for financial reliability:

Timeliness: A query for "current" data is fundamentally unanswered if the agent retrieves stale data, even with a syntactically perfect API call.
Intent Restraint: Agents must strictly differentiate between informational queries and transactional actions, never escalating to execution without explicit authorization.
Domain Alignment: The chosen tool chain must strictly adhere to the regulatory and market domain of the query (e.g., using equity tools for a cryptocurrency query is a domain hallucination).

FinToolBench is introduced to bridge this gap. It is a runnable benchmark built from real free-tier tools and tool-required questions, designed to produce auditable tool traces and evaluate both capability (invocation/execution success) and compliance (adherence to finance-specific constraints).

Methodology

1. Benchmark Construction Pipeline

The construction of FinToolBench follows an eight-stage pipeline (Figure 2):

Stages 1-4 (Tool Inventory): Collect raw tools from RapidAPI (diverse, real-time) and AkShare (stable, open-source). Tools are filtered for executability, normalized into a unified manifest schema, and annotated with finance attributes.
Stages 5-8 (Question Set): Source questions from existing finance QA datasets (e.g., FinanceBench). Filter for tool-required queries, align them with tools via semantic retrieval and LLM verification, and conduct human-in-the-loop quality assurance.

2. Tool Inventory & Annotation

The final tool library contains 760 tools (filtered from 5,470 candidates). Each tool is annotated with a lightweight finance attribute schema:

Table 1: Finance attribute schema used in FinToolBench.

Attribute	Values	Evaluation role
`update_frequency`	`realtime`, `daily`, `as_filed`, `periodic`, `static`	Penalize stale calls when timeliness is required.
`intent_type`	`informational`, `advisory`, `transactional`	Penalize escalation beyond user intent.
`regulatory_domain`	set-valued	Penalize domain-mismatched tool usage.

Tool execution is logged in a standardized trace format for auditing:

Table 2: Normalized tool-trace fields.

Field	Description
`step`	The sequential order of the call within the multi-turn process.
`tool_name`	The identifier of the specific tool invoked.
`parameters`	The JSON-formatted arguments generated by the model.
`output`	The tool response, including data or structured error messages.
`error`	Overall execution status.

3. Finance-Aware Tool Routing (FATR) Baseline

FATR is a model-agnostic baseline that operationalizes financial constraints (Figure 3). Its pipeline:

Tool Retrieval: Uses BGE-M3 embeddings to retrieve a small candidate set (Top-K=20) of tools for a given query.
Tool Card Formatting: Presents retrieved tools as Tool Cards that include the finance attributes [Timeliness: ...], [Intent Type: ...], [Regulatory Domains: ...] (Figure 4).
Attribute-Aware Planning: An LLM planner runs a ReAct loop, first inferring the query's implied constraints $(T(q), I(q), D(q))$ , then performing constraint-aware planning to select compatible tools.
Stabilized Execution: Implements caching, retries, and output compression to handle API instability.

4. Evaluation Metrics

Evaluation uses two groups of metrics derived from the tool trace:

Capability Metrics:
- Tool Invocation Rate (TIR): Fraction of samples with non-empty tool calls.
- Tool Execution Success Rate (TESR): Fraction of samples whose tool-augmented traces execute successfully.
- Conditional Execution Rate (CER): $CER = TESR / TIR$ (execution success given invocation).
- Soft Score & CSS: LLM-judged answer correctness (Soft Score overall, CSS conditional on successful execution).
Compliance Metrics (Mismatch Rates): For a question $q$ $q$ with trace $\tau = \{(t_k, x_k, o_k)\}_{k=1}^m$ $τ = {(t_{k}, x_{k}, o_{k})}_{k = 1}^{m}$ , a mismatch occurs if any tool call violates the corresponding constraint. The rates are averaged over questions with tool calls.
- Timeliness Mismatch Rate (TMR): $TMR(q, \tau) = \mathbb{1}[\exists k: J_T(q, A(t_k), \tau_k) = 0]$
- Intent Mismatch Rate (IMR): $IMR(q, \tau) = \mathbb{1}[\exists k: J_I(q, A(t_k), \tau_k) = 0]$
- Domain Mismatch Rate (DMR): $DMR(q, \tau) = \mathbb{1}[\exists k: J_D(q, A(t_k), \tau_k) = 0]$

Empirical Validation / Results

Main Results

Table 3: Main results on FinToolBench.

Model	TIR	TESR	CER ↑	Soft Score ↑	CSS ↑	TMR ↓	IMR ↓	DMR ↓
Doubao-Seed-1.6	0.6508	0.3254	0.5000	0.3958	0.4627	0.3438	0.6563	0.1719
Qwen3-8B	0.8712	0.2949	0.3385	0.4234	0.4040	0.3307	0.6887	0.1673
GLM-4.7-Flash	0.4407	0.2102	0.4769	0.2769	0.3791	0.4615	0.7231	0.1769
GPT-4o	0.2267	0.1400	0.6176	0.2302	0.6700	0.3529	0.5000	0.1176

Qwen3-8B is the most aggressive tool user (highest TIR) and achieves the highest soft accuracy, but suffers from frequent execution/argument errors (low CER).
Doubao-Seed-1.6 shows the most balanced performance, with the highest TESR and strong CER.
GPT-4o is highly conservative (lowest TIR) but very precise when it does use tools, achieving the highest CER and CSS, and the lowest DMR.
GLM-4.7-Flash exhibits the weakest performance with high mismatch rates.

Impact of Finance Attribute Injection

An ablation study comparing FATR with and without finance tags in tool cards shows that attribute injection:

Slightly reduces tool invocation (TIR) as the planner becomes more cautious.
Improves conditional execution reliability (CER) and reduces compliance mismatch rates (TMR, IMR, DMR).
Demonstrates that explicit constraint awareness guides better tool selection and improves compliance alignment.

Tool Usage Distribution

Analysis of runs (Figure 6) shows:

103 runs ended with no tool call.
114 runs used a single tool.
78 runs required multiple tools (with 3-tool traces being most common at 35.9%). This distribution justifies reporting both coverage (TIR) and conditional reliability (CER) metrics.

Theoretical and Practical Implications

Theoretical: FinToolBench establishes a new evaluation paradigm for financial AI agents, shifting focus from static answer correctness to dynamic, trace-level compliance auditing. It formally operationalizes domain-specific constraints (timeliness, intent, domain) as evaluable metrics.
Practical: The benchmark provides a standardized, runnable testbed for developing and auditing trustworthy financial agents. The FATR baseline offers a practical blueprint for building agents that are both capable and compliant. By open-sourcing the tool manifest and evaluation code, the work lowers the barrier to entry for rigorous financial agent evaluation, promoting reproducibility and comparability across future research.

Conclusion

FinToolBench addresses a critical gap in evaluating LLM agents for real-world financial tool use. By coupling a large-scale, executable tool ecosystem with a finance-aware evaluation framework, it sets a new standard for assessing both the capability and regulatory compliance of AI agents in high-stakes domains. Key takeaways include:

Effective financial agents must balance aggressive tool use with execution precision and strict compliance.
Making financial constraints explicit to the agent (via attribute injection) improves tool selection and reduces domain mismatches.
Future work should extend coverage to proprietary data feeds and study agent robustness against tool API drift.

Artifact Availability: The implementation is available at: https://github.com/Double-wk/FinToolBench.git.