# DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

> DataClaw 0's Agentic Data Tailoring transforms raw multimodal streams into structured data via a learnable agent, rivaling GPT-4o and Gemini.

- **Source:** [arXiv](https://arxiv.org/abs/2606.21337)
- **Published:** 2026-06-24
- **Permalink:** https://picx.dev/p/e6D28X
- **Whiteboard:** https://picx.dev/p/e6D28X/image

## Summary

## Summary (Overview)
- **Paradigm shift**: The paper introduces **Agentic Data Tailoring**, elevating data processing from passive heuristic rules to a learnable capability that actively filters, reasons, and reorganizes raw multimodal streams into intent-aligned structured outputs.
- **Two-stage pipeline**: A scalable automated construction pipeline first extracts deterministic **Factual Anchors** (e.g., object states, temporal boundaries) using lightweight experts, then uses strong VLMs for top-down semantic synthesis, producing 34,717 cross-domain training samples.
- **DataClaw 0 model**: A 9B VLM (Qwen3.5-9B base) trained with Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) with rule-driven rewards (format compliance, factual grounding, efficiency). Two deployment paradigms: **Omni** (unified model) and **Expert** (domain-decoupled) agents.
- **New benchmark**: Introduces **DataClaw 0 -val** (including a **Fuzzy intent** subset) for systematic evaluation of structured data refinement, revealing deficiencies in existing general VLMs.
- **Downstream validation**: DataClaw 0 -generated data significantly improves performance on GUI navigation, action video generation, and spatio-temporal VQA under the same data budget, achieving up to **33.2% overall accuracy** (vs. 9.8% zero-shot) and **51.2 Contact mAP** (vs. 18.5% zero-shot).

## Introduction and Theoretical Foundation
Raw multimodal streams (e.g., long tutorial videos, embodied trajectories, GUI logs) contain dense procedural knowledge but suffer from high **data entropy** – noise, redundancy, weak structure. Existing passive annotation (heuristic rules, coarse captioning, direct VLM prompting) is costly and fails for long streams requiring temporal reasoning and spatial consistency.

The authors propose **Agentic Data Tailoring**: given a high-level user intent $I$ and raw data $X_{\text{raw}}$, an agent actively filters task-critical evidence and reorganizes it into structured, verifiable outputs $Y_{\text{struct}}$ conforming to a schema $\Phi$. This recasts data production as a learnable, high-order capability rather than a static annotation process.

The optimization objective is defined as:
$$\theta^* = \arg\max_\theta \sum_{(X_{\text{raw}}, I, Y_{\text{struct}}) \in \mathcal{D}} \log P(Y_{\text{struct}} | X_{\text{raw}}, I; \theta) \cdot \mathbb{I}(Y_{\text{struct}} \in \Phi)$$
where $\mathbb{I}(\cdot)$ enforces structural schema compliance. Two core agentic capabilities are required: (1) information filtering (removing redundancy from long streams) and (2) structural reorganization (generating schema-conforming output).

## Methodology

### 3.1 Problem Formulation
Input: raw multimodal stream $X_{\text{raw}} = \{x_1, x_2, \ldots, x_T\}$ (e.g., video frames) and intent $I$. Output: structured asset $Y_{\text{struct}} = \{y_1, y_2, \ldots, y_L\}$ strictly adhering to schema $\Phi$.

### 3.2 Data Construction Pipeline
A two-stage pipeline generates training triplets $(X_{\text{raw}}, I, Y_{\text{struct}})$.

**Stage 1 – Factual Anchor Extraction**: A lightweight expert ensemble $H$ extracts deterministic anchors from raw data:
$$A = H(X_{\text{raw}}) = \{a_k = (t_k, p_k, c_k)\}_{k=1}^K$$
where each anchor records timestamp $t_k$, spatial position $p_k$, and local semantic content $c_k$ (e.g., object states, OCR text, GUI events). These provide reliable grounding to reduce hallucinations.

**Stage 2 – Generative Semantic Synthesis**: A strong VLM $S$ produces structured supervision conditioned on raw input, extracted anchors, and domain intent:
$$Y_{\text{struct}} = S(X_{\text{raw}}, A, I_{\text{domain}})$$
The resulting corpus covers five domains: daily life (46.4%), embodied (21.3%), GUI (16.5%), education (9.3%), and AIGC (6.5%).

### 3.3 Rule-Driven Reinforcement Learning via GRPO
After SFT, DataClaw 0 is further optimized with **Group Relative Policy Optimization (GRPO)** using deterministic rewards (no separate reward model):

$$R(Y) = \lambda_1 R_{\text{format}}(Y, \Phi) + \lambda_2 R_{\text{anchor}}(Y, A) + \lambda_3 R_{\text{eff}}(Y)$$

- $R_{\text{format}}$: checks schema compliance.
- $R_{\text{anchor}}$: measures alignment with extracted factual anchors.
- $R_{\text{eff}}$: penalizes overly verbose reasoning.

For a group of sampled outputs, advantages are normalized:
$$\hat{A}^{(g)} = \frac{R^{(g)} - \mu_R}{\sigma_R}$$
The policy is updated with a clipped objective and KL regularizer against the reference model.

### 3.4 Deployment Paradigms
- **DataClaw 0 -O (Omni)**: Single unified model for all domains; simple but suffers from task interference.
- **DataClaw 0 -E (Expert)**: Decoupled domain-specific experts; routing ensures specialization and better performance.

### 3.5 Benchmark
**DataClaw 0 -val**: 200 diversity-sampled examples + **DataClaw 0 -Intent** (fuzzy/ambiguous requests). Evaluated hierarchically: JSON validity → schema-field correctness → semantic alignment → trajectory-shape similarity.

## Empirical Validation / Results

### 4.2 Main Results: Specialist vs. Generalist
**Table 1** compares DataClaw 0 against proprietary (Claude-Sonnet-4, GPT-4o, Gemini-3.1-Pro) and open-source (MiniMax-M2.7, Qwen3.6-plus, Qwen3.5-9B) models. Metrics: **Field** (schema completeness), **Semantic** (content correctness), **Sequence** (ordering/structural consistency).

| Model | Metric | GUI | Embodied | AIGC | Daily | Education | Fuzzy | Overall |
|-------|--------|-----|----------|------|-------|-----------|-------|---------|
| **DataClaw 0 -E [Ours]** | Field | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 85.17 | **97.53** |
| | Semantic | 89.18 | 82.93 | 75.36 | 49.72 | 76.43 | 76.28 | 74.94 |
| | Sequence | 96.33 | **71.60** | 15.26 | 42.59 | 19.75 | **50.31** | 48.86 |
| **DataClaw 0 -O [Ours]** | Field | 100.00 | 100.00 | 85.00 | 92.50 | 70.00 | 78.42 | 87.65 |
| | Semantic | 80.01 | 63.37 | 55.70 | 62.61 | 45.71 | 67.35 | 62.46 |
| | Sequence | 85.70 | 67.01 | 23.90 | 35.05 | 17.41 | 39.84 | 44.82 |
| GPT-4o | Field | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 83.61 | 97.27 |
| | Semantic | 84.81 | 87.55 | 69.38 | 54.58 | 80.21 | 74.39 | **75.15** |
| | Sequence | 80.69 | 46.33 | 46.95 | 29.33 | 50.71 | 42.57 | 49.43 |
| Gemini-3.1-Pro | Field | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 88.74 | **98.12** |
| | Semantic | 90.01 | 89.17 | 75.26 | 54.51 | 54.51 | 79.63 | 73.85 |
| | Sequence | 99.67 | 67.97 | 33.14 | 51.48 | 51.48 | 47.25 | **58.50** |

- DataClaw 0 -E matches proprietary models on Field scores (97.53 vs. 98.12 for Gemini) and achieves best Sequence on Embodied (71.60) and Fuzzy (50.31).
- Expert routing (DataClaw 0 -E) significantly outperforms Omni (DataClaw 0 -O) on all metrics, confirming domain specialization benefits.

### 4.3 Downstream Application: Targeted Refinement & Efficiency
**Table 2** compares models fine-tuned on data from different annotators (same base models, same data volume). Tasks: GUI navigation (AgentNet), action video generation (Ego4D → Wan2.2-I2V-5B), spatio-temporal VQA (ReMoT).

| Data Source | GUI (SSR↑, TSR↑) | Video Generation (FVD↓, Consis.↑, Contact mAP↑) | VQA (Partial↑, Overall↑) |
|-------------|------------------|---------------------------------------------------|---------------------------|
| Zero-shot | 12.4, 1.2 | 385.2, 68.4, 18.5 | 28.3, 9.8 |
| Base Model | 16.8, 3.5 | 362.1, 69.1, 24.2 | 33.5, 14.2 |
| Gemini-3.1-Pro | 39.5, 14.2 | 295.4, 76.2, 48.5 | 53.4, 31.5 |
| **DataClaw 0** | 38.2, **15.6** | **288.6**, 75.8, **51.2** | 52.1, **33.2** |

- DataClaw 0 -generated data yields competitive or superior results to Gemini, especially on end-to-end metrics (TSR 15.6%, Overall Accuracy 33.2%, Contact mAP 51.2%).
- This demonstrates DataClaw 0 ’s high-utility data even under strict volume alignment.

### 4.4 Scaling Laws and Emergent Diversity
- DataClaw 0 -O shows unstable scaling due to task interference; DataClaw 0 -E mitigates this via expert routing, achieving stable gains with more data.
- Feature-space analysis shows DataClaw 0 improves semantic diversity beyond simple pattern replication.

### 4.5 Ablation Studies
**Table 3a** – Reward design (on Embodied+GUI subset):

| Variant | Field | Semantic | Sequence |
|---------|-------|----------|----------|
| Minimal Init. | 82.50 | 36.79 | 45.40 |
| SFT Only | 100.00 | 82.54 | 70.83 |
| GRPO w/o $R_{\text{anchor}}$ | 100.00 | 83.32 | 70.11 |
| GRPO w/ $R_{\text{anchor}}$ | 100.00 | 82.36 | **71.96** |

- Adding the factual anchor reward $R_{\text{anchor}}$ improves sequence consistency without damaging semantic performance, confirming the importance of explicit spatio-temporal grounding.

**Table 3b** – Expert routing (correct vs. forced wrong):

| Domain (Expert) | Field | Semantic | Sequence |
|-----------------|-------|----------|----------|
| Embodied (GUI) | 0.00 | 0.00 | 50.00 |
| Embodied (Emb) | 96.50 | 74.21 | 63.48 |
| GUI (GUI) | 100.00 | 84.93 | 76.41 |
| GUI (Emb) | 0.00 | 52.55 | 0.00 |

- Cross-domain misrouting severely degrades performance, validating the need for accurate expert selection.

### 4.6 Qualitative Analysis
Figure 4 demonstrates DataClaw 0 -E correctly identifies hesitation loops in robot manipulation videos, selects temporally consistent evidence, and outputs structured JSON with task, question, answer, and chain-of-thought. Baseline VLMs produce missing fields, incorrect clip ranges, or unstructured outputs.

## Theoretical and Practical Implications
- **Paradigm shift**: Formalizes data processing as an active, learnable, intent-driven capability rather than a passive annotation task. This opens the door for scalable, high-quality data curation for multimodal model training.
- **Scalable data construction**: The two-stage pipeline (factual anchors + generative synthesis) provides a blueprint for generating large-scale, grounded training data without expensive human annotation, reducing hallucinations and improving spatio-temporal consistency.
- **Reinforcement learning for data quality**: GRPO with rule-driven rewards (format, factual grounding, efficiency) proves effective for aligning small models to produce structured, high-utility outputs, demonstrating that compact 9B agents can rival proprietary annotators.
- **Downstream efficiency**: DataClaw 0 enables “Targeted Refinement” – producing compact, task-relevant training subsets that outperform full-scale conventional datasets at drastically reduced computational costs. This addresses both data scarcity and training efficiency bottlenecks.
- **Domain specialization matters**: The Expert paradigm (domain-decoupled agents) substantially outperforms a unified model, indicating that heterogeneous multimodal streams require specialized processing. This informs future deployment strategies.

## Conclusion
DataClaw 0 presents a **structured multimodal data tailoring framework** that transforms raw, high-entropy streams into schema-aligned, high-quality training data. By combining deterministic factual anchors, generative synthesis via strong VLMs, and reinforcement learning (SFT+GRPO), the 9B model achieves structured generation quality competitive with proprietary models (GPT-4o, Gemini-3.1-Pro) while offering superior controllability and domain specialization. Downstream experiments across GUI navigation, video generation, and VQA show that DataClaw 0 -generated data significantly improves task performance under limited data budgets.

**Future directions**: Scaling the framework to more domains (e.g., 3D, audio), exploring model-free anchor extraction, and further improving the robustness of expert routing for real-time, diverse user intents. The project page is available at [https://czjdsg.github.io/MakeAnyData](https://czjdsg.github.io/MakeAnyData).

---

_Markdown view of https://picx.dev/p/e6D28X, served by PicX — AI-generated visual whiteboard summaries of research papers._