ClawGym: A Scalable Framework for Building Effective Claw Agents

Summary (Overview)

Framework Proposal: Introduces ClawGym, a comprehensive, data-centric framework to support the full lifecycle of developing Claw-style personal agents. These agents operate within users' computer environments (like OpenClaw) to perform multi-step workflows over local files, tools, and persistent workspace states.
Synthetic Dataset: Constructs ClawGym-SynData, a large-scale, diverse dataset of 13.5K executable tasks, synthesized via a dual-route pipeline combining persona-driven top-down and skill-grounded bottom-up generation, paired with realistic mock workspaces and hybrid verification.
Agent Training: Develops ClawGym-Agents through supervised fine-tuning (SFT) on 24.5K high-quality interaction trajectories collected via black-box rollouts on OpenClaw. Further explores a lightweight, sandbox-parallel reinforcement learning (RL) pipeline.
Evaluation Benchmark: Establishes ClawGym-Bench, a reliable benchmark of 200 rigorously verified task instances for evaluating Claw agents, calibrated through automated difficulty filtering and human-LLM review.
Empirical Results: Demonstrates that SFT on synthesized data yields substantial performance gains. For example, Qwen3-8B improves by 38.90% on PinchBench and 43.46% on ClawGym-Bench. ClawGym-30A3B outperforms the much larger Qwen3-235B-A23B on ClawGym-Bench, showing data quality can partially compensate for scale.

Introduction and Theoretical Foundation

Recent advancements in autonomous agent frameworks like OpenClaw have reshaped AI integration with daily digital life. Unlike traditional chatbots, these Claw-style agents are deployed directly within users' computer environments, where they can invoke tools, manage local file systems, and interact with web-enabled services to tackle real-world, multi-step tasks.

However, autonomous agents still struggle to reliably handle everyday digital tasks. This gap stems from the distinctive nature of Claw-style tasks:

Grounded in Local Workspace States: Require reasoning over existing artifacts, executing tools, and updating the workspace through multi-step interactions.
Operate Through Opaque Interfaces: Must handle ambiguous instructions, unexpected states, tool errors, and long-horizon dependencies across sessions.
Lack of Large-Scale Data: Compared to static text-based or structured agent benchmarks, synthesizing large-scale, verifiable Claw-style task data is challenging due to the need for personalized requirements, long-horizon verifiability, and realistic mock workspaces.

Recognizing these challenges, ClawGym is proposed as a systematic framework to unify task synthesis, agent training, and performance evaluation for developing Claw-style personal agents.

Methodology

1. Task Definition

A Claw-style agent task is an environment-grounded instruction-execution problem. Formally, a task instance is denoted as:

\tau = \langle p, s_0, \mathcal{A}, \mathcal{F}, \mathcal{V}_\tau \rangle, \tag{1}

where:

$p$ is the user instruction.
$s_0$ is the initial environment state (workspace).
$\mathcal{A}$ is the set of available actions (tools).
$\mathcal{F}$ describes how tool execution updates the environment state.
$\mathcal{V}_\tau$ is the task-specific verifier.

The agent produces a trajectory $\xi = (A_1, O_1, A_2, O_2, ..., A_K, O_K)$ , where $A_k$ are action segments and $O_k$ are observation segments. After $H$ executable actions, the final state is $s_H = \mathcal{F}(s_{t-1}, a_t)$ . Success is determined by the verifier score: $v = \mathcal{V}_\tau(s_0, s_H, y), v \in [0,1]$ .

2. ClawGym-SynData: Task Synthesis Pipeline

The synthesis framework has four main stages (see Figure 1).

2.1 Task Generation

Persona-Driven Top-Down Synthesis: Starts from high-level user contexts.
1. Seed Formulation: Combines a user persona $u$ , a scenario category $c$ , and a set of basic operations $\mathcal{G} = \{g_1, g_2, ..., g_n\}$ into a seed $z = (u, c, \mathcal{G})$ . (Equation 5)
2. Instruction Generation: An LLM ( $M_{task}$ , e.g., GPT-5) generates the concrete user instruction: $p = M_{task}(\pi(z))$ . (Equation 6)
Skill-Grounded Bottom-Up Synthesis: Starts from concrete OpenClaw capabilities.
1. Skill Annotation & Filtering: Raw skills from ClawHub are annotated by an LLM ( $M_{ann}$ ), and filtered to retain synthesizable skills $\mathcal{K}^+$ . (Equation 7)
2. Skill-Composition Task Construction: Tasks are constructed from one primary skill $k_{main}$ and optional supporting skills $\mathcal{K}_{support}$ . An LLM generates the instruction: $p = M_{task}(\pi(k_{main}, \mathcal{K}_{support}))$ . (Equation 8)

2.2 Resource Preparation

For each instruction $p$ , a resource specification $f = \{(l_i, t_i, d_i)\}_{i=1}^m$ (Equation 9) is generated, where $l_i$ is file path, $t_i$ is file type, and $d_i$ is content spec. An LLM-based generator materializes $f$ into concrete mock files (txt, markdown, json, csv, yaml) placed in the workspace.

2.3 Verification Design

A hybrid verification scheme is designed for each task.

Code-Based Verification: A set of atomic verification points $\mathcal{C} = \{c_1, c_2, ..., c_m\}$ (Equation 10) returns binary scores $b_i = \mathbb{I}[c_i(p, s_0, s_H, y) = true]$ (Equation 11). The code-based score is $s_{code} = \frac{1}{m} \sum_{i=1}^{m} b_i$ (Equation 12).
Rubric-Based Verification: A set of rubric rules $\mathcal{R} = \{r_1, r_2, ..., r_n\}$ (Equation 13) assigns ordinal scores $q_j \in \{0, 0.25, 0.5, 0.75, 1.0\}$ (Equation 14). The rubric-based score is $s_{rubric} = \frac{\sum_{j=1}^{n} w_j q_j}{\sum_{j=1}^{n} w_j}$ (Equation 15).
Score Aggregation: The final task score combines the two components: $s_{task} = \lambda s_{code} + (1-\lambda)s_{rubric}$ (Equation 17). The implementation uses $\lambda = 0.7$ , prioritizing objective checks.

2.4 Automated Quality Assessment

Tasks and verifiers are filtered based on:

Task Quality: Assessed for novelty (cosine similarity), plausibility (LLM judge), and difficulty (LLM judge).
Verification Quality: For code checkers, checks executability and task-checker alignment. For rubrics, ensures they complement rather than duplicate code checks.

3. ClawGym-Agents: Agent Training

Black-Box Rollout: High-fidelity interaction trajectories are collected by executing synthesized tasks at scale through the OpenClaw harness in Docker environments, using teacher models (MiniMax-M2.5, GLM-5.1).
Trajectory Aggregation & Selection: Raw logs are aggregated into coherent multi-turn sequences. Trajectories are selected if their final verifier score exceeds a reward threshold (optimal found at 0.5). The final dataset contains 24.5K trajectories.
Supervised Fine-Tuning (SFT): Multi-turn SFT is performed on Qwen3-series models (4B, 8B, 30B-A3B). Context length is extended (e.g., YaRN for 8B to 64K). Loss is masked to focus on model-generated actions, not environment feedback.
Reinforcement Learning (RL): A lightweight sandbox-parallel pipeline is explored using GRPO, showing performance gains even from vanilla and SFT-initialized models.

4. ClawGym-Bench: Benchmark Construction

A reliable evaluation benchmark of 200 tasks is constructed via a rigorous pipeline:

Difficulty-Aware Filtering: From the candidate pool $\mathcal{D}_{cand}$ , tasks are retained only if they satisfy rollout-based criteria (using strong and small LLM agents): $\begin{cases} \bar{s}_{strong}(\tau) \ge 0.2, \\ \bar{s}_{small}(\tau) \le 0.6, \\ \bar{s}_{strong}(\tau) > \bar{s}_{small}(\tau). \end{cases} \tag{18}$
LLM-Assisted Human Review: A frontier LLM (GPT-5.4) performs diagnostic review; human reviewers make final decisions.
Benchmark Composition: The final 200 tasks cover 6 categories (see Table 4). 156 use pure code-based verification; 44 use hybrid verification.

Empirical Validation / Results

1. Synthesized Data Analysis

Task Distributions: Persona-driven synthesis covers diverse scenarios and atomic actions (Figure 2). Skill-grounded synthesis is anchored in a broad capability space (Table 1).
Human Quality Analysis: A sample of 50 training tasks received positive ratings, with an overall average score of 4.06/5 (Table 2).

2. Main Evaluation Results

Models are evaluated on ClawGym-Bench and the external PinchBench. Key results are shown in Table 6.

Table 6: Performance Comparison on ClawGym-Bench and PinchBench (Excerpt)

Model	PinchBench	ClawGym-Bench (Avg)
Proprietary Frontier Models
Claude-4.7-Opus	79.40	77.81
GPT-5.4	68.30	73.49
Open-Weight Frontier Models
GLM-5.1	76.40	71.12
Qwen3.5-Plus	78.70	70.35
Compact Open-Weight Models (Baselines)
Qwen3-8B	54.50	35.02
Qwen3-30A3B	55.60	45.11
Qwen3-235A23B	60.60	54.48
ClawGym-Agents (SFT on SynData)
ClawGym-8B	75.70	50.24
ClawGym-30A3B	86.00	56.82

Key Findings:

Effectiveness of Synthesized Data: SFT on ClawGym-SynData yields substantial gains. ClawGym-30A3B (56.82) outperforms the much larger Qwen3-235A23B (54.48).
Generalization: Strong performance on the external PinchBench shows the data teaches transferable agentic principles.
Synergy of Synthesis Strategies: Models trained on Mixed Synthesis (combining both pipelines) outperform those trained on either alone (Table 7).
Training Dynamics: Performance peaks around epoch 3, then slightly declines, indicating an optimal training scale (Figure: Effect of training trajectory scale).
RL Effectiveness: The lightweight RL pipeline provides consistent gains from both vanilla and SFT-initialized models (Figure 3).

3. Behavioral Analysis

Analysis of agent trajectories reveals key capability dimensions for Claw agents:

Tool-Use Appropriateness: Effective agents compose tools into coherent discovery-inspection-computation-verification pipelines, not just invoke them (Figure 6).
Long-Horizon Execution Robustness: Robust agents interpret feedback, recover from disruptions, and maintain coherent progress without losing task context (Figure 7).
Fine-Grained Instruction Following: Reliable agents preserve detailed constraints (e.g., filtering rules) across generated artifacts and derived outputs (Figure 8).

Theoretical and Practical Implications

Data-Centric Agent Development: ClawGym demonstrates the feasibility and impact of a systematic, data-centric approach for developing environment-grounded agents, moving beyond reliance solely on model scale or algorithm innovation.
Scalable Synthesis Framework: The dual-route synthesis pipeline provides a blueprint for generating large-scale, verifiable training data for complex, interactive agent domains, addressing a critical data scarcity.
Bridging the Capability Gap: High-quality, domain-specific training data can significantly elevate the performance of smaller, open-weight models, making capable personal agents more accessible.
Reliable Evaluation: ClawGym-Bench establishes a rigorous protocol for evaluating Claw agents, emphasizing evaluation stability and verifiable solvability, which is crucial for meaningful progress tracking.

Conclusion

ClawGym presents a scalable, end-to-end framework for developing Claw-style personal agents. Its core contributions are:

ClawGym-SynData: A large-scale synthesized dataset enabled by a novel dual-route pipeline.
ClawGym-Agents: A family of agents trained via SFT on high-fidelity trajectories, showing substantial performance improvements.
ClawGym-Bench: A reliable, human-verified benchmark for evaluation.

Empirical results validate the framework's effectiveness, with trained agents achieving competitive performance against larger models and proprietary systems. Behavioral analyses offer insights into the key capabilities—tool-use appropriateness, execution robustness, and fine-grained instruction following—that future work should target to build more reliable environment-grounded agents.