T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Summary (Overview)

Problem Focus: Proposes a novel red-teaming framework for LLM agents integrated with tools via standards like MCP, addressing vulnerabilities that emerge through multi-step tool execution, not just harmful text generation.
Core Solution: T-MAP (Trajectory-aware MAP-Elites), an evolutionary algorithm that uses execution trajectories to guide the search for adversarial prompts. It incorporates two key mechanisms: Cross-Diagnosis for prompt-level strategy transfer and a Tool Call Graph (TCG) for action-level guidance.
Key Results: Outperforms all baselines across five diverse MCP environments (CodeExecutor, Slack, Gmail, Playwright, Filesystem), achieving an average Attack Realization Rate (ARR) of 57.8%. It also discovers a greater number of distinct, successful attack trajectories.
Generalization: Effective against nine frontier LLM models (e.g., GPT-5.2, Gemini-3-Pro) and generalizes to complex Multi-MCP chain environments requiring cross-server tool execution.
Impact: Demonstrates that trajectory-aware evolution is essential for uncovering previously underexplored, operationally significant vulnerabilities in autonomous LLM agents.

Introduction and Theoretical Foundation

The deployment of LLM agents capable of executing tools via protocols like the Model Context Protocol (MCP) shifts safety risks from harmful text generation to harmful environmental actions (e.g., financial loss, data exfiltration). Prior red-teaming efforts have focused on eliciting harmful text, overlooking vulnerabilities that only emerge through complex, multi-step tool execution and planning.

Existing automated red-teaming methods, including those based on the MAP-Elites algorithm for diversity, operate primarily at the text level. Frameworks for evaluating agent safety exist but are limited to static threat assessment or operate in fixed environments. The open problem is the systematic discovery of diverse, multi-step harmful actions in open-ended agent settings.

This paper formalizes red-teaming for LLM agents, where success is measured by whether harmful objectives are realized through actual tool execution. The goal is to discover attack prompts $x$ that cause a target agent $p_\theta$ to generate an interactive trajectory $h(x)$ comprising reasoning, actions, and observations over $K$ steps:

h(x) = \{ (r_k, a_k, o_k) \}_{k=1}^K, \quad r_k \sim p_\theta(\cdot | h_k(x)), \quad a_k \sim p_\theta(\cdot | r_k, h_k(x)), \quad o_k = \text{Env}(a_k)

where $h_1(x) = x$ and $h_k(x) = (x, r_1, a_1, o_1, ..., r_{k-1}, a_{k-1}, o_{k-1})$ . The harmfulness of $h(x)$ is evaluated by an LLM-as-a-judge.

Methodology

T-MAP is a Trajectory-aware MAP-Elites algorithm. It maintains a two-dimensional archive $\mathcal{A}$ spanning risk categories $c \in \mathcal{C}$ (e.g., property loss, data leakage) and attack styles $s \in \mathcal{S}$ (e.g., role-playing, hypothetical framing), storing the best-performing prompt $x_{c,s}$ and its trajectory $h(x_{c,s})$ for each cell $(c, s)$ .

The algorithm proceeds through an iterative four-step cycle:

Initialization: The archive is populated with seed prompts for each cell.
Parent-Target Selection: A high-success parent cell $(c_p, s_p)$ and a uniformly sampled target cell $(c_t, s_t)$ are selected.
Trajectory-Guided Mutation: A new candidate prompt $x'$ $x^{'}$ is generated for the target cell using two complementary mechanisms:
- Cross-Diagnosis (Prompt-level): An LLM Analyst extracts success factors from the parent trajectory $h(x_{c_p, s_p})$ and failure causes from the target trajectory $h(x_{c_t, s_t})$ .
- Tool Call Graph (Action-level): A TCG $\mathcal{G} = (\mathcal{V}, \mathcal{E}, F_{\mathcal{G}})$ is maintained, where $\mathcal{V} = \mathcal{T} \cup \{\text{END}\}$ (tools), $\mathcal{E} \subseteq \mathcal{V} \times \mathcal{V}$ (directed edges for tool transitions), and $F_{\mathcal{G}}: \mathcal{E} \to \mathcal{M}$ maps each edge to metadata. For an edge $(t_i, t_j)$ , the metadata $m_{ij} \in \mathcal{M}$ is a tuple $(n_s, n_f, R_s, R_f)$ counting successes/failures and recording reasons. The LLM Mutator uses this to query empirical success rates and avoid high-failure transitions.
Evaluation and Update: The new prompt $x'$ is executed, and its trajectory $h(x')$ is evaluated by an LLM Judge into discrete success levels. If superior, it replaces the elite in the target archive cell. The TCG is then updated with edge-level outcomes from $h(x')$ .

Empirical Validation / Results

Experiments were conducted across five MCP environments with GPT-5-mini as the primary target agent. Attack success was categorized into four levels:

L0 (Refused): Agent refuses; no tool calls.
L1 (Error): Tool calls issued but raise an error.
L2 (Weak Success): Tool calls issued, harmful intent partially completed.
L3 (Realized): Harmful intent fully realized through tool execution. Attack Realization Rate (ARR) is the proportion of L3 trajectories.

Baselines: Zero-Shot (ZS), Multi-Trial (MT), Iterative Refinement (IR), and Standard Evolution (SE).

Main Results

Table 1: Comparison of refusal rate (RR, ↓) and attack realization rate (ARR, ↑) across different MCP environments.

Method	CodeExecutor	Slack	Gmail	Playwright	Filesystem	Average
	RR	ARR	RR	ARR	RR	ARR
ZS	100.0	0.0	90.6	0.0	84.4	0.0
MT	73.4	1.6	48.4	10.9	65.6	3.1
IR	70.3	3.1	34.4	10.9	45.3	15.6
SE	17.2	48.4	26.6	28.1	20.3	10.9
T-MAP	14.1	56.2	15.6	64.1	9.4	46.9

Superior Performance: T-MAP achieves the highest ARR and lowest RR in all environments (average ARR 57.8%).
Evolution & Coverage: T-MAP converges faster and achieves higher ARR than baselines over iterations (Fig. 4). Its archive coverage heatmaps show a wide distribution of realized attacks (L3), unlike baselines which are dominated by weak success or localized success (Fig. 5).
Diversity: T-MAP discovers the largest number of distinct successful tool invocation sequences ( $|\mathcal{H}_{L_3}| = 21.80$ on average) while maintaining low lexical (Self-BLEU=0.25) and semantic (Cosine sim.=0.47) similarity among elite prompts (Table 2).
Judge Reliability: The judge model (DeepSeek-V3.2) shows high correlation with other model judges and human annotators (Spearman > 0.83).

Target Model Generalization

T-MAP maintains high ARR across nine frontier models, including GPT-5.2, Gemini-3-Pro, and Claude models (Fig. 6). Attacks discovered on one model (GPT-5.2) show significant cross-model transferability (Fig. 7).

Ablation Study

Table 4: Ablation results of T-MAP, averaged across all five MCP environments. | Method | L0 (↓) | L1 (↓) | L2 (↑) | L3 (↑) | $|\mathcal{H}_{L_3}|$ (↑) | | :--- | :---: | :---: | :---: | :---: | :---: | | w/o TCG | 13.09 | 20.13 | 21.09 | 45.71 | 21.38 | | w/o Cross-Diagnosis | 15.63 | 11.51 | 23.05 | 49.81 | 21.13 | | T-MAP | 11.93 | 10.95 | 18.75 | 58.40 | 23.88 |

Removing the TCG increases errors (L1) and reduces realized attacks (L3). Removing Cross-Diagnosis increases refusals (L0). Both components are crucial for maximizing attack success and diversity.

Generalization to Multi-MCP Chains

In complex settings where agents chain tools across multiple MCP servers (e.g., Slack+CodeExecutor), T-MAP consistently achieves the highest ARR (Fig. 8). Critically, 46.28% of its unique trajectories involve cross-server tool invocations, far exceeding baselines (14–23%) (Table 5), demonstrating its ability to discover sophisticated, multi-domain attack strategies.

Theoretical and Practical Implications

Theoretical: Formalizes a new objective for agent red-teaming—attack realization through tool execution—and demonstrates the necessity of incorporating trajectory-level feedback (via Cross-Diagnosis and TCG) into evolutionary search to effectively explore this objective.
Practical: Provides a scalable, automated framework for proactively discovering severe, operationally relevant vulnerabilities in deployed LLM agents before they cause real harm. The discovered diverse attack trajectories and prompts can inform the development of more robust safety guardrails and monitoring systems for autonomous agents.

Conclusion

T-MAP effectively red-teams LLM agents by leveraging trajectory-aware evolutionary search. It significantly outperforms existing methods in realizing harmful multi-step tool executions across diverse environments and frontier models. The work highlights that vulnerabilities in autonomous agents are qualitatively different from those in chat-based LLMs and that trajectory-aware evolution is critical for their safe deployment.

Limitations & Ethics: Experiments are conducted in sandboxed environments; real-world safeguards may affect real-world ARR. The framework's effectiveness may evolve with improving model safety alignment. The authors emphasize the dual-use concern and state the work is intended solely for proactive vulnerability discovery to improve agent safety. All experiments were sandboxed, and sensitive details are redacted.