AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Summary (Overview)

Framework: Introduces a lightweight and scalable framework for aligning AI agents (like OpenClaw) with safety and security requirements, addressing emergent risks from advanced execution scenarios.
Core Model: Develops AgentDoG 1.5, a family of small (0.8B to 8B parameter) diagnostic guardrail models trained with a novel data engine using only ~1k purified samples, achieving performance comparable to large frontier models (e.g., GPT-5.4).
Benchmark & Taxonomy: Updates and extends a three-dimensional agent safety taxonomy (Risk Source, Failure Mode, Real-world Harm) and the ATBench family of benchmarks (ATBench, ATBench-Claw, ATBench-Codex) to cover diverse agent execution environments.
Applications: Demonstrates two key applications: 1) Agentic Safety SFT & RL Training, where AgentDoG 1.5 serves as a data filter and reward model, and 2) Online Agent Safety Guardrail, where it acts as a training-free, low-latency runtime monitor for agent trajectories.
Performance: Shows state-of-the-art performance on trajectory-level safety evaluation and fine-grained risk diagnosis across multiple benchmarks, while enabling efficient deployment (e.g., reducing training environment overhead by two orders of magnitude).

Introduction and Theoretical Foundation

The rapid development of agentic AI systems (e.g., OpenClaw, Hermes) with powerful cross-environment execution capabilities introduces broad new safety risk surfaces. Concurrently, advanced frontier AI models lower the barrier for adversarial attacks, rendering existing agent alignment frameworks inadequate.

The paper argues that a robust alignment framework requires three key components:

A clear and standardized agentic safety taxonomy for unified evaluation.
A lightweight and scalable agentic safety training pipeline integrating a data engine, a safety verifier, and an efficient training environment.
A training-free system for online agent safety, including a systematic architecture and a lightweight guard model for low-cost runtime supervision.

The proposed framework builds upon the trajectory-level safety diagnosis concept from AgentDoG and ATBench, which decomposes safety analysis into three dimensions: Risk Source, Failure Mode, and Real-world Harm. This structured approach is essential because unsafe outcomes in agent systems are multi-faceted, arising from diverse sources (user instructions, tool descriptions, etc.), manifesting as specific failures (incorrect tool calls, over-privileged actions, etc.), and leading to concrete harms (privacy leakage, system damage, etc.).

Methodology

1. Updated Safety Taxonomy and ATBench Family

The framework retains the original three-dimensional decomposition but introduces a customization mechanism to adapt to new agent execution settings (like Codex and OpenClaw) without losing cross-setting comparability.

Customization Operations:
- Adding new leaf categories: For risks not covered by existing labels (e.g., OpenClaw's session contamination, approval bypass; Codex's repository artifact injection, dependency/MCP supply-chain compromise).
- Strengthening inherited categories: Sharpening the operational scope of base concepts for new settings (e.g., failure to validate tool outputs in Codex specifically covers validation of test outputs, build logs, etc.).
Benchmark Instances:
- ATBench: Base setting for general tool-use agents (1,000 trajectories).
- ATBench-Claw: Extended for OpenClaw agents, focusing on sessions, approvals, routing, etc. (500 trajectories).
- ATBench-Codex: Extended for Codex agents, focusing on repository artifacts, shell execution, patches, etc. (500 trajectories).

All instances share the same trajectory-level diagnosis task (binary safe/unsafe judgment + three fine-grained labels).

2. AgentDoG 1.5 Model Construction

Task Definition: AgentDoG 1.5 performs two diagnostic tasks:

Trajectory-level safety evaluation: Predict a binary label $y \in \{safe, unsafe\}$ for a given trajectory $T = \{t_1, ..., t_n\}$ , where $y = unsafe \iff \exists i \in \{1, ..., n\}, Unsafe(t_i) = True$ .
Fine-grained risk diagnosis: For unsafe trajectories, predict diagnostic labels $y_{fine} = (\ell_{mode}, \ell_{harm}, \ell_{risk}) \in L_{mode} \times L_{harm} \times L_{risk}$ .

Data Preparation via Taxonomy-Guided Data Engine:

Data Collection: A planner-based pipeline synthesizes long-horizon, tool-augmented interaction trajectories.
- Stage 1: Planning: Samples a risk configuration tuple (one category from each taxonomy dimension) and determines safety outcome (safe/unsafe).
- Stage 2: Trajectory Synthesis: Instantiates the sketch into a complete multi-turn trace, with safe/unsafe variants.
- Stage 3: Automatic Validation: Dual-layer (rule + model) filtering for structural and semantic quality.
Reasoning Chain-of-Thought (CoT) Augmentation: GPT-5.4 generates detailed step-by-step reasoning for each training sample.
Fine-grained Data Selection and Balance: Soft balancing to prevent overrepresented categories from dominating training.
Data Purification: Uses a preference-aware influence-function-based data selection method to retain the most informative examples (~1k samples).

Let $D_{raw} = \{z_i\}_{i=1}^n$ denote the raw SFT pool. For a set of safety target prompts $Q_{safe}$ , paired responses $(y_q^+, y_q^-)$ define desired guardrail behavior. The preference-weighted target gradient, representing the parameter-space change to increase correct risk identification, is constructed as:

\hat{g}_{guard} = \frac{1}{|Q_{safe}|} \sum_{q \in Q_{safe}} \hat{\pi}_q (\hat{\bar{g}}(q, y_q^+) - \hat{\bar{g}}(q, y_q^-))

where $\hat{\pi}_q$ is the normalized likelihood preference for the correct response, and $\hat{\bar{g}}(q, y)$ is the length-normalized gradient of the loss. Each candidate example $z = (x, y)$ is scored by its gradient alignment with this direction: $s_\pi(z) = \hat{g}_z^\top \hat{g}_{guard}$ . High-scoring examples are retained.

Training:

Supervised Fine-Tuning (SFT): Optimizes the conditional likelihood of target outputs: $L_{SFT}(\theta) = -\mathbb{E}_{(x,y)\sim D} \sum_{t=1}^{|y|} \log \pi_\theta(y_t | x, y_{<t})$ Fine-tuned Qwen3.5 and Llama-3.1 base models with a learning rate of $1e-5$ .
Reinforcement Learning (RL): Uses Group Reward-Decoupled Normalization Policy Optimization (GDPO) to refine the SFT policy with verifiable rewards. For each query $q_i$ , $G$ responses are sampled and scored along three dimensions (failure mode, real-world harm, risk source), yielding a binary reward vector $(r_1, r_2, r_3)$ . GDPO normalizes advantages per dimension, combines them with weights $(w_1, w_2, w_3) = (0.3, 0.4, 0.3)$ , and applies batch-level normalization. The token-level clipped surrogate objective is: $J_{GDPO}(\theta) = \mathbb{E}_{q_i \sim D, \{o_{i,j}\}_{j=1}^G \sim \pi_{\theta_{old}}(\cdot|q_i)} \left[ \frac{1}{G} \sum_{j=1}^{G} \frac{1}{T_{i,j}} \sum_{t=1}^{T_{i,j}} \left( \ell_{i,j,t}^{clip}(\theta) - \beta D_{KL}(\pi_\theta(\cdot | q_i, o_{i,j}^{<t}) \parallel \pi_{ref}(\cdot | q_i, o_{i,j}^{<t})) \right) \right]$ where $\ell_{i,j,t}^{clip}(\theta) = \min\left( s_{i,j,t}(\theta) \hat{A}_{sum}^{(i,j)}, \text{clip}(s_{i,j,t}(\theta), 1-\epsilon_{low}, 1+\epsilon_{high}) \hat{A}_{sum}^{(i,j)} \right)$ , $s_{i,j,t}(\theta)$ is the policy ratio, $\epsilon_{low}=0.2$ , $\epsilon_{high}=0.28$ , $\beta=0.001$ , $lr=1e-6$ , and $G=8$ .

3. Applications

Application 1: Agentic Safety SFT & RL Training

SFT: AgentDoG 1.5 filters high-quality safety trajectories from a raw ATBench corpus (32,787 pairs), resulting in 28,705 trajectories. Mixed with benign tool-use data, this forms the SFT dataset.
RL: Constructs lightweight RL environments via finite-state Python simulators with injected adversarial risks. AgentDoG 1.5 serves as an external judge to provide safety reward signals. The overall reward $R$ combines task utility $U$ and safety score $S$ : $R = \begin{cases} U & \text{for clean tasks} \\ S & \text{for malicious query attacks} \\ 0.25 \cdot U \cdot S + 0.25 \cdot S + 0.5 \cdot U & \text{for environment injection attacks} \end{cases}$ The environment design is highly scalable, supporting over 10,000 concurrent environments on a standard 8-core machine with low latency and memory footprint (<2.5 GB peak).

Application 2: Online Agent Safety Guardrail

Design: A Pre-Reply guardrail pipeline deployed at runtime before the agent's final response is delivered. It buffers the execution trace, invokes AgentDoG 1.5 for judgment, and blocks unsafe replies.
Pipeline Stages:
1. Live agent execution (e.g., on OpenClaw).
2. Online guardrail service (trajectory formatting + AgentDoG 1.5 judgment).
3. Runtime monitoring dashboard.

Empirical Validation / Results

Evaluation Setup

Benchmarks: R-Judge, ATBench, ATBench-Claw, ATBench-Codex.
Metrics: Accuracy, Precision, Recall, F1-score for trajectory-level safety evaluation; Accuracy for fine-grained risk diagnosis (Risk Source, Failure Mode, Real-world Harm).
Baselines: Closed-source frontier models (GPT-5.4, Gemini-3-Flash, etc.), general open-source models (Qwen3.5 series, Llama-3.1, etc.), specialized guard models (LlamaGuard, Qwen3-Guard, ShieldAgent, etc.).

Key Results

Trajectory-Level Safety Evaluation: AgentDoG 1.5 demonstrates strong performance, with the 4B variant achieving the best overall results among open-source and guard models.

Table 2: Performance comparison across R-Judge and ATBench using Accuracy, Precision, Recall, and F1-score.

Model	R-Judge Acc	R-Judge F1	ATBench Acc	ATBench F1
GPT-5.4	93.3	93.7	73.7	76.7
Gemini-3-Flash	95.2	95.3	76.4	74.9
Qwen3.5-397B-A17B	85.6	87.4	66.8	67.8
LlamaGuard4-12B	63.8	63.2	58.1	41.7
AgentDoG 1.5-4B	92.2	92.7	72.4	74.3
AgentDoG 1.5-4B-U	90.4	90.6	78.4	77.7
AgentDoG 1.5-0.8B	75.7	74.6	60.3	63.2

AgentDoG 1.5-4B outperforms much larger open-source models (e.g., Qwen3.5-397B) on ATBench and approaches closed-source frontier models.
Small variants (0.8B, 2B) show favorable efficiency-performance trade-offs, outperforming many larger guard and general models.
The unified variant AgentDoG 1.5-4B-U (trained on both coarse and fine-grained tasks) shows a "bonus effect," achieving the best ATBench trajectory-level performance among the variants.

Fine-Grained Risk Diagnosis: AgentDoG 1.5 provides strong diagnostic capability, significantly outperforming general-purpose and frontier models.

Table 3: Performance comparison on ATBench for fine-grained risk diagnosis (Accuracy %).

Model	Risk Source	Failure Mode	Real-world Harm	Avg.
GPT-5.4	33.6	13.5	30.2	25.8
Gemini-3-Flash	18.4	8.3	15.0	13.9
Qwen3.5-397B	7.7	3.6	6.8	6.0
AgentDoG 1.5-4B	75.2	27.5	62.9	55.2
AgentDoG 1.5-0.8B	65.7	18.4	44.9	43.0

AgentDoG 1.5-4B achieves the best overall performance (Avg. 55.2%), a 20.6-point improvement over AgentDoG 1.0-4B.
Compact variants (0.8B, 2B) already exceed all closed-source and general open-source baselines.

Performance Across Agentic Execution Environments: AgentDoG 1.5 generalizes robustly to Codex and OpenClaw settings.

ATBench-Codex: AgentDoG 1.5-4B achieves 80.0% accuracy.
ATBench-Claw: AgentDoG 1.5-4B achieves 84.0% accuracy.
The 0.8B variant also shows strong transferability (70.2% on ATBench-Codex, 78.4% on ATBench-Claw).

Application 1: Agentic Safety SFT & RL

SFT Data Filtering: Training with AgentDoG 1.5-filtered safety data improves safety metrics over using unfiltered data or no safety data.
- Reduces AgentHarm Harm Score from 57.49% (base) to 20.32% (filtered), increases Refusal Rate to 75.00%.
- Improves BFCL function-calling accuracy (81.12% vs. 78.69% for unfiltered).
RL Training: The joint SFT+RL approach enhances safety while recovering benign utility.
- Achieves the highest Refusal Rate (77.27%) and Safe Rate (59.32%), while maintaining a strong BFCL score (81.25%).

Application 2: Online Guardrail Effectiveness: AgentDoG 1.5 reduces the residual unsafe final-delivery rate (Attack Success Rate, ASR) in online deployment.

Table 6: Expanded Pre-Reply guardrail comparison under each benchmark’s native unsafe criterion.

Benchmark / Guardrail	ASR (%)	∆ ASR (%)
ClawSafety
w/o guardrail	56.25	–
AgentDoG 1.5-4B	18.75	-37.50
AgentDoG 1.5-0.8B	25.00	-31.25
AgentHazard
w/o guardrail	41.92	–
AgentDoG 1.5-4B	26.92	-15.00
AgentDoG 1.5-0.8B	29.23	-12.69
CIK-Bench
w/o guardrail	94.29	–
AgentDoG 1.5-4B	42.86	-51.43
AgentDoG 1.5-0.8B	85.71	-8.57

AgentDoG 1.5-4B obtains the lowest residual unsafe final-delivery rate on ClawSafety (18.75%) and CIK-Bench (42.86%).
Overhead remains practical (TTFT sub-second, TPOT a few hundredths of a second per token).

Theoretical and Practical Implications

Scalable Safety Framework: The taxonomy customization mechanism and benchmark family provide a scalable methodology for evaluating and aligning agents in new, evolving execution environments without requiring task redefinition.
Efficient Model Training: The influence-function-based data purification demonstrates that high-performance safety models can be trained with very small datasets (~1k samples), drastically reducing data collection and training costs.
Lightweight Deployment: The small model sizes (0.8B-8B) and efficient RL environment design (two orders of magnitude reduction in overhead) enable practical, low-cost deployment of safety alignment techniques in real-world systems.
Integrated Safety Alignment: The framework demonstrates that safety alignment can be effectively integrated into both the training phase (via SFT data filtering and RL reward modeling) and the deployment phase (via online guardrails), offering a comprehensive approach.
Open Release: All models and datasets are openly released,