# ClawGym: A Scalable Framework for Building Effective Claw Agents

> ClawGym is a scalable framework that synthesizes realistic computer tasks and trains agents on them, enabling models like Qwen3-8B to achieve over 38% performance gains on benchmarks.

- **Source:** [arXiv](https://arxiv.org/abs/2604.26904)
- **Published:** 2026-05-01
- **Permalink:** https://picx.dev/p/D2DTJA
- **Whiteboard:** https://picx.dev/p/D2DTJA/image

## Summary

# ClawGym: A Scalable Framework for Building Effective Claw Agents

## Summary (Overview)
*   **Framework Proposal**: Introduces **ClawGym**, a comprehensive, data-centric framework to support the full lifecycle of developing **Claw-style personal agents**. These agents operate within users' computer environments (like OpenClaw) to perform multi-step workflows over local files, tools, and persistent workspace states.
*   **Synthetic Dataset**: Constructs **ClawGym-SynData**, a large-scale, diverse dataset of **13.5K** executable tasks, synthesized via a dual-route pipeline combining **persona-driven top-down** and **skill-grounded bottom-up** generation, paired with realistic mock workspaces and hybrid verification.
*   **Agent Training**: Develops **ClawGym-Agents** through supervised fine-tuning (SFT) on **24.5K** high-quality interaction trajectories collected via black-box rollouts on OpenClaw. Further explores a lightweight, sandbox-parallel reinforcement learning (RL) pipeline.
*   **Evaluation Benchmark**: Establishes **ClawGym-Bench**, a reliable benchmark of **200** rigorously verified task instances for evaluating Claw agents, calibrated through automated difficulty filtering and human-LLM review.
*   **Empirical Results**: Demonstrates that SFT on synthesized data yields substantial performance gains. For example, **Qwen3-8B** improves by **38.90%** on PinchBench and **43.46%** on ClawGym-Bench. **ClawGym-30A3B** outperforms the much larger **Qwen3-235B-A23B** on ClawGym-Bench, showing data quality can partially compensate for scale.

## Introduction and Theoretical Foundation
Recent advancements in autonomous agent frameworks like **OpenClaw** have reshaped AI integration with daily digital life. Unlike traditional chatbots, these **Claw-style agents** are deployed directly within users' computer environments, where they can invoke tools, manage local file systems, and interact with web-enabled services to tackle real-world, multi-step tasks.

However, autonomous agents still struggle to reliably handle everyday digital tasks. This gap stems from the distinctive nature of Claw-style tasks:
*   **Grounded in Local Workspace States**: Require reasoning over existing artifacts, executing tools, and updating the workspace through multi-step interactions.
*   **Operate Through Opaque Interfaces**: Must handle ambiguous instructions, unexpected states, tool errors, and long-horizon dependencies across sessions.
*   **Lack of Large-Scale Data**: Compared to static text-based or structured agent benchmarks, synthesizing large-scale, verifiable Claw-style task data is challenging due to the need for personalized requirements, long-horizon verifiability, and realistic mock workspaces.

Recognizing these challenges, **ClawGym** is proposed as a systematic framework to unify **task synthesis**, **agent training**, and **performance evaluation** for developing Claw-style personal agents.

## Methodology

### 1. Task Definition
A Claw-style agent task is an environment-grounded instruction-execution problem. Formally, a task instance is denoted as:
$$
\tau = \langle p, s_0, \mathcal{A}, \mathcal{F}, \mathcal{V}_\tau \rangle, \tag{1}
$$
where:
*   $p$ is the user instruction.
*   $s_0$ is the initial environment state (workspace).
*   $\mathcal{A}$ is the set of available actions (tools).
*   $\mathcal{F}$ describes how tool execution updates the environment state.
*   $\mathcal{V}_\tau$ is the task-specific verifier.

The agent produces a trajectory $\xi = (A_1, O_1, A_2, O_2, ..., A_K, O_K)$, where $A_k$ are action segments and $O_k$ are observation segments. After $H$ executable actions, the final state is $s_H = \mathcal{F}(s_{t-1}, a_t)$. Success is determined by the verifier score: $v = \mathcal{V}_\tau(s_0, s_H, y), v \in [0,1]$.

### 2. ClawGym-SynData: Task Synthesis Pipeline
The synthesis framework has four main stages (see Figure 1).

#### 2.1 Task Generation
*   **Persona-Driven Top-Down Synthesis**: Starts from high-level user contexts.
    1.  **Seed Formulation**: Combines a user persona $u$, a scenario category $c$, and a set of basic operations $\mathcal{G} = \{g_1, g_2, ..., g_n\}$ into a seed $z = (u, c, \mathcal{G})$. (Equation 5)
    2.  **Instruction Generation**: An LLM ($M_{task}$, e.g., GPT-5) generates the concrete user instruction: $p = M_{task}(\pi(z))$. (Equation 6)
*   **Skill-Grounded Bottom-Up Synthesis**: Starts from concrete OpenClaw capabilities.
    1.  **Skill Annotation & Filtering**: Raw skills from ClawHub are annotated by an LLM ($M_{ann}$), and filtered to retain synthesizable skills $\mathcal{K}^+$. (Equation 7)
    2.  **Skill-Composition Task Construction**: Tasks are constructed from one primary skill $k_{main}$ and optional supporting skills $\mathcal{K}_{support}$. An LLM generates the instruction: $p = M_{task}(\pi(k_{main}, \mathcal{K}_{support}))$. (Equation 8)

#### 2.2 Resource Preparation
For each instruction $p$, a resource specification $f = \{(l_i, t_i, d_i)\}_{i=1}^m$ (Equation 9) is generated, where $l_i$ is file path, $t_i$ is file type, and $d_i$ is content spec. An LLM-based generator materializes $f$ into concrete mock files (txt, markdown, json, csv, yaml) placed in the workspace.

#### 2.3 Verification Design
A hybrid verification scheme is designed for each task.
*   **Code-Based Verification**: A set of atomic verification points $\mathcal{C} = \{c_1, c_2, ..., c_m\}$ (Equation 10) returns binary scores $b_i = \mathbb{I}[c_i(p, s_0, s_H, y) = true]$ (Equation 11). The code-based score is $s_{code} = \frac{1}{m} \sum_{i=1}^{m} b_i$ (Equation 12).
*   **Rubric-Based Verification**: A set of rubric rules $\mathcal{R} = \{r_1, r_2, ..., r_n\}$ (Equation 13) assigns ordinal scores $q_j \in \{0, 0.25, 0.5, 0.75, 1.0\}$ (Equation 14). The rubric-based score is $s_{rubric} = \frac{\sum_{j=1}^{n} w_j q_j}{\sum_{j=1}^{n} w_j}$ (Equation 15).
*   **Score Aggregation**: The final task score combines the two components: $s_{task} = \lambda s_{code} + (1-\lambda)s_{rubric}$ (Equation 17). The implementation uses $\lambda = 0.7$, prioritizing objective checks.

#### 2.4 Automated Quality Assessment
Tasks and verifiers are filtered based on:
*   **Task Quality**: Assessed for **novelty** (cosine similarity), **plausibility** (LLM judge), and **difficulty** (LLM judge).
*   **Verification Quality**: For code checkers, checks **executability** and **task-checker alignment**. For rubrics, ensures they **complement** rather than duplicate code checks.

### 3. ClawGym-Agents: Agent Training
1.  **Black-Box Rollout**: High-fidelity interaction trajectories are collected by executing synthesized tasks at scale through the **OpenClaw harness** in Docker environments, using teacher models (MiniMax-M2.5, GLM-5.1).
2.  **Trajectory Aggregation & Selection**: Raw logs are aggregated into coherent multi-turn sequences. Trajectories are selected if their final verifier score exceeds a **reward threshold** (optimal found at **0.5**). The final dataset contains **24.5K** trajectories.
3.  **Supervised Fine-Tuning (SFT)**: Multi-turn SFT is performed on Qwen3-series models (4B, 8B, 30B-A3B). Context length is extended (e.g., YaRN for 8B to 64K). Loss is masked to focus on model-generated actions, not environment feedback.
4.  **Reinforcement Learning (RL)**: A lightweight **sandbox-parallel pipeline** is explored using GRPO, showing performance gains even from vanilla and SFT-initialized models.

### 4. ClawGym-Bench: Benchmark Construction
A reliable evaluation benchmark of **200** tasks is constructed via a rigorous pipeline:
1.  **Difficulty-Aware Filtering**: From the candidate pool $\mathcal{D}_{cand}$, tasks are retained only if they satisfy rollout-based criteria (using strong and small LLM agents):
    $$
    \begin{cases}
    \bar{s}_{strong}(\tau) \ge 0.2, \\
    \bar{s}_{small}(\tau) \le 0.6, \\
    \bar{s}_{strong}(\tau) > \bar{s}_{small}(\tau).
    \end{cases} \tag{18}
    $$
2.  **LLM-Assisted Human Review**: A frontier LLM (GPT-5.4) performs diagnostic review; human reviewers make final decisions.
3.  **Benchmark Composition**: The final 200 tasks cover 6 categories (see Table 4). **156** use pure code-based verification; **44** use hybrid verification.

## Empirical Validation / Results

### 1. Synthesized Data Analysis
*   **Task Distributions**: Persona-driven synthesis covers diverse scenarios and atomic actions (Figure 2). Skill-grounded synthesis is anchored in a broad capability space (Table 1).
*   **Human Quality Analysis**: A sample of 50 training tasks received positive ratings, with an overall average score of **4.06/5** (Table 2).

### 2. Main Evaluation Results
Models are evaluated on **ClawGym-Bench** and the external **PinchBench**. Key results are shown in Table 6.

**Table 6: Performance Comparison on ClawGym-Bench and PinchBench (Excerpt)**
| Model | PinchBench | ClawGym-Bench (Avg) |
| :--- | :--- | :--- |
| **Proprietary Frontier Models** | | |
| Claude-4.7-Opus | 79.40 | **77.81** |
| GPT-5.4 | 68.30 | 73.49 |
| **Open-Weight Frontier Models** | | |
| GLM-5.1 | 76.40 | 71.12 |
| Qwen3.5-Plus | 78.70 | 70.35 |
| **Compact Open-Weight Models (Baselines)** | | |
| Qwen3-8B | 54.50 | 35.02 |
| Qwen3-30A3B | 55.60 | 45.11 |
| Qwen3-235A23B | 60.60 | 54.48 |
| **ClawGym-Agents (SFT on SynData)** | | |
| **ClawGym-8B** | **75.70** | **50.24** |
| **ClawGym-30A3B** | **86.00** | **56.82** |

**Key Findings:**
*   **Effectiveness of Synthesized Data**: SFT on ClawGym-SynData yields substantial gains. ClawGym-30A3B (**56.82**) outperforms the much larger Qwen3-235A23B (**54.48**).
*   **Generalization**: Strong performance on the external PinchBench shows the data teaches transferable agentic principles.
*   **Synergy of Synthesis Strategies**: Models trained on **Mixed Synthesis** (combining both pipelines) outperform those trained on either alone (Table 7).
*   **Training Dynamics**: Performance peaks around epoch 3, then slightly declines, indicating an optimal training scale (Figure: Effect of training trajectory scale).
*   **RL Effectiveness**: The lightweight RL pipeline provides consistent gains from both vanilla and SFT-initialized models (Figure 3).

### 3. Behavioral Analysis
Analysis of agent trajectories reveals key capability dimensions for Claw agents:
*   **Tool-Use Appropriateness**: Effective agents compose tools into coherent discovery-inspection-computation-verification pipelines, not just invoke them (Figure 6).
*   **Long-Horizon Execution Robustness**: Robust agents interpret feedback, recover from disruptions, and maintain coherent progress without losing task context (Figure 7).
*   **Fine-Grained Instruction Following**: Reliable agents preserve detailed constraints (e.g., filtering rules) across generated artifacts and derived outputs (Figure 8).

## Theoretical and Practical Implications
*   **Data-Centric Agent Development**: ClawGym demonstrates the feasibility and impact of a systematic, data-centric approach for developing environment-grounded agents, moving beyond reliance solely on model scale or algorithm innovation.
*   **Scalable Synthesis Framework**: The dual-route synthesis pipeline provides a blueprint for generating large-scale, verifiable training data for complex, interactive agent domains, addressing a critical data scarcity.
*   **Bridging the Capability Gap**: High-quality, domain-specific training data can significantly elevate the performance of smaller, open-weight models, making capable personal agents more accessible.
*   **Reliable Evaluation**: ClawGym-Bench establishes a rigorous protocol for evaluating Claw agents, emphasizing evaluation stability and verifiable solvability, which is crucial for meaningful progress tracking.

## Conclusion
ClawGym presents a scalable, end-to-end framework for developing Claw-style personal agents. Its core contributions are:
1.  **ClawGym-SynData**: A large-scale synthesized dataset enabled by a novel dual-route pipeline.
2.  **ClawGym-Agents**: A family of agents trained via SFT on high-fidelity trajectories, showing substantial performance improvements.
3.  **ClawGym-Bench**: A reliable, human-verified benchmark for evaluation.

Empirical results validate the framework's effectiveness, with trained agents achieving competitive performance against larger models and proprietary systems. Behavioral analyses offer insights into the key capabilities—tool-use appropriateness, execution robustness, and fine-grained instruction following—that future work should target to build more reliable environment-grounded agents.

---

_Markdown view of https://picx.dev/p/D2DTJA, served by PicX — AI-generated visual whiteboard summaries of research papers._