Visual Summary | DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Summary (Overview)

Paradigm shift: The paper introduces Agentic Data Tailoring, elevating data processing from passive heuristic rules to a learnable capability that actively filters, reasons, and reorganizes raw multimodal streams into intent-aligned structured outputs.
Two-stage pipeline: A scalable automated construction pipeline first extracts deterministic Factual Anchors (e.g., object states, temporal boundaries) using lightweight experts, then uses strong VLMs for top-down semantic synthesis, producing 34,717 cross-domain training samples.
DataClaw 0 model: A 9B VLM (Qwen3.5-9B base) trained with Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO) with rule-driven rewards (format compliance, factual grounding, efficiency). Two deployment paradigms: Omni (unified model) and Expert (domain-decoupled) agents.
New benchmark: Introduces DataClaw 0 -val (including a Fuzzy intent subset) for systematic evaluation of structured data refinement, revealing deficiencies in existing general VLMs.
Downstream validation: DataClaw 0 -generated data significantly improves performance on GUI navigation, action video generation, and spatio-temporal VQA under the same data budget, achieving up to 33.2% overall accuracy (vs. 9.8% zero-shot) and 51.2 Contact mAP (vs. 18.5% zero-shot).

Introduction and Theoretical Foundation

Raw multimodal streams (e.g., long tutorial videos, embodied trajectories, GUI logs) contain dense procedural knowledge but suffer from high data entropy – noise, redundancy, weak structure. Existing passive annotation (heuristic rules, coarse captioning, direct VLM prompting) is costly and fails for long streams requiring temporal reasoning and spatial consistency.

The authors propose Agentic Data Tailoring: given a high-level user intent $I$ and raw data $X_{\text{raw}}$ , an agent actively filters task-critical evidence and reorganizes it into structured, verifiable outputs $Y_{\text{struct}}$ conforming to a schema $\Phi$ . This recasts data production as a learnable, high-order capability rather than a static annotation process.

The optimization objective is defined as:

\theta^* = \arg\max_\theta \sum_{(X_{\text{raw}}, I, Y_{\text{struct}}) \in \mathcal{D}} \log P(Y_{\text{struct}} | X_{\text{raw}}, I; \theta) \cdot \mathbb{I}(Y_{\text{struct}} \in \Phi)

where $\mathbb{I}(\cdot)$ enforces structural schema compliance. Two core agentic capabilities are required: (1) information filtering (removing redundancy from long streams) and (2) structural reorganization (generating schema-conforming output).

Methodology

3.1 Problem Formulation

Input: raw multimodal stream $X_{\text{raw}} = \{x_1, x_2, \ldots, x_T\}$ (e.g., video frames) and intent $I$ . Output: structured asset $Y_{\text{struct}} = \{y_1, y_2, \ldots, y_L\}$ strictly adhering to schema $\Phi$ .

3.2 Data Construction Pipeline

A two-stage pipeline generates training triplets $(X_{\text{raw}}, I, Y_{\text{struct}})$ .

Stage 1 – Factual Anchor Extraction: A lightweight expert ensemble $H$ extracts deterministic anchors from raw data:

A = H(X_{\text{raw}}) = \{a_k = (t_k, p_k, c_k)\}_{k=1}^K

where each anchor records timestamp $t_k$ , spatial position $p_k$ , and local semantic content $c_k$ (e.g., object states, OCR text, GUI events). These provide reliable grounding to reduce hallucinations.

Stage 2 – Generative Semantic Synthesis: A strong VLM $S$ produces structured supervision conditioned on raw input, extracted anchors, and domain intent:

Y_{\text{struct}} = S(X_{\text{raw}}, A, I_{\text{domain}})

The resulting corpus covers five domains: daily life (46.4%), embodied (21.3%), GUI (16.5%), education (9.3%), and AIGC (6.5%).

3.3 Rule-Driven Reinforcement Learning via GRPO

After SFT, DataClaw 0 is further optimized with Group Relative Policy Optimization (GRPO) using deterministic rewards (no separate reward model):

R(Y) = \lambda_1 R_{\text{format}}(Y, \Phi) + \lambda_2 R_{\text{anchor}}(Y, A) + \lambda_3 R_{\text{eff}}(Y)

$R_{\text{format}}$ : checks schema compliance.
$R_{\text{anchor}}$ : measures alignment with extracted factual anchors.
$R_{\text{eff}}$ : penalizes overly verbose reasoning.

For a group of sampled outputs, advantages are normalized:

\hat{A}^{(g)} = \frac{R^{(g)} - \mu_R}{\sigma_R}

The policy is updated with a clipped objective and KL regularizer against the reference model.

3.4 Deployment Paradigms

DataClaw 0 -O (Omni): Single unified model for all domains; simple but suffers from task interference.
DataClaw 0 -E (Expert): Decoupled domain-specific experts; routing ensures specialization and better performance.

3.5 Benchmark

DataClaw 0 -val: 200 diversity-sampled examples + DataClaw 0 -Intent (fuzzy/ambiguous requests). Evaluated hierarchically: JSON validity → schema-field correctness → semantic alignment → trajectory-shape similarity.

Empirical Validation / Results

4.2 Main Results: Specialist vs. Generalist

Table 1 compares DataClaw 0 against proprietary (Claude-Sonnet-4, GPT-4o, Gemini-3.1-Pro) and open-source (MiniMax-M2.7, Qwen3.6-plus, Qwen3.5-9B) models. Metrics: Field (schema completeness), Semantic (content correctness), Sequence (ordering/structural consistency).

Model	Metric	GUI	Embodied	AIGC	Daily	Education	Fuzzy	Overall
DataClaw 0 -E [Ours]	Field	100.00	100.00	100.00	100.00	100.00	85.17	97.53
	Semantic	89.18	82.93	75.36	49.72	76.43	76.28	74.94
	Sequence	96.33	71.60	15.26	42.59	19.75	50.31	48.86
DataClaw 0 -O [Ours]	Field	100.00	100.00	85.00	92.50	70.00	78.42	87.65
	Semantic	80.01	63.37	55.70	62.61	45.71	67.35	62.46
	Sequence	85.70	67.01	23.90	35.05	17.41	39.84	44.82
GPT-4o	Field	100.00	100.00	100.00	100.00	100.00	83.61	97.27
	Semantic	84.81	87.55	69.38	54.58	80.21	74.39	75.15
	Sequence	80.69	46.33	46.95	29.33	50.71	42.57	49.43
Gemini-3.1-Pro	Field	100.00	100.00	100.00	100.00	100.00	88.74	98.12
	Semantic	90.01	89.17	75.26	54.51	54.51	79.63	73.85
	Sequence	99.67	67.97	33.14	51.48	51.48	47.25	58.50

DataClaw 0 -E matches proprietary models on Field scores (97.53 vs. 98.12 for Gemini) and achieves best Sequence on Embodied (71.60) and Fuzzy (50.31).
Expert routing (DataClaw 0 -E) significantly outperforms Omni (DataClaw 0 -O) on all metrics, confirming domain specialization benefits.

4.3 Downstream Application: Targeted Refinement & Efficiency

Table 2 compares models fine-tuned on data from different annotators (same base models, same data volume). Tasks: GUI navigation (AgentNet), action video generation (Ego4D → Wan2.2-I2V-5B), spatio-temporal VQA (ReMoT).

Data Source	GUI (SSR↑, TSR↑)	Video Generation (FVD↓, Consis.↑, Contact mAP↑)	VQA (Partial↑, Overall↑)
Zero-shot	12.4, 1.2	385.2, 68.4, 18.5	28.3, 9.8
Base Model	16.8, 3.5	362.1, 69.1, 24.2	33.5, 14.2
Gemini-3.1-Pro	39.5, 14.2	295.4, 76.2, 48.5	53.4, 31.5
DataClaw 0	38.2, 15.6	288.6, 75.8, 51.2	52.1, 33.2

DataClaw 0 -generated data yields competitive or superior results to Gemini, especially on end-to-end metrics (TSR 15.6%, Overall Accuracy 33.2%, Contact mAP 51.2%).
This demonstrates DataClaw 0 ’s high-utility data even under strict volume alignment.

4.4 Scaling Laws and Emergent Diversity

DataClaw 0 -O shows unstable scaling due to task interference; DataClaw 0 -E mitigates this via expert routing, achieving stable gains with more data.
Feature-space analysis shows DataClaw 0 improves semantic diversity beyond simple pattern replication.

4.5 Ablation Studies

Table 3a – Reward design (on Embodied+GUI subset):

Variant	Field	Semantic	Sequence
Minimal Init.	82.50	36.79	45.40
SFT Only	100.00	82.54	70.83
GRPO w/o $R_{\text{anchor}}$	100.00	83.32	70.11
GRPO w/ $R_{\text{anchor}}$	100.00	82.36	71.96

Adding the factual anchor reward $R_{\text{anchor}}$ improves sequence consistency without damaging semantic performance, confirming the importance of explicit spatio-temporal grounding.

Table 3b – Expert routing (correct vs. forced wrong):

Domain (Expert)	Field	Semantic	Sequence
Embodied (GUI)	0.00	0.00	50.00
Embodied (Emb)	96.50	74.21	63.48
GUI (GUI)	100.00	84.93	76.41
GUI (Emb)	0.00	52.55	0.00

Cross-domain misrouting severely degrades performance, validating the need for accurate expert selection.

4.6 Qualitative Analysis

Figure 4 demonstrates DataClaw 0 -E correctly identifies hesitation loops in robot manipulation videos, selects temporally consistent evidence, and outputs structured JSON with task, question, answer, and chain-of-thought. Baseline VLMs produce missing fields, incorrect clip ranges, or unstructured outputs.

Theoretical and Practical Implications

Paradigm shift: Formalizes data processing as an active, learnable, intent-driven capability rather than a passive annotation task. This opens the door for scalable, high-quality data curation for multimodal model training.
Scalable data construction: The two-stage pipeline (factual anchors + generative synthesis) provides a blueprint for generating large-scale, grounded training data without expensive human annotation, reducing hallucinations and improving spatio-temporal consistency.
Reinforcement learning for data quality: GRPO with rule-driven rewards (format, factual grounding, efficiency) proves effective for aligning small models to produce structured, high-utility outputs, demonstrating that compact 9B agents can rival proprietary annotators.
Downstream efficiency: DataClaw 0 enables “Targeted Refinement” – producing compact, task-relevant training subsets that outperform full-scale conventional datasets at drastically reduced computational costs. This addresses both data scarcity and training efficiency bottlenecks.
Domain specialization matters: The Expert paradigm (domain-decoupled agents) substantially outperforms a unified model, indicating that heterogeneous multimodal streams require specialized processing. This informs future deployment strategies.

Conclusion

DataClaw 0 presents a structured multimodal data tailoring framework that transforms raw, high-entropy streams into schema-aligned, high-quality training data. By combining deterministic factual anchors, generative synthesis via strong VLMs, and reinforcement learning (SFT+GRPO), the 9B model achieves structured generation quality competitive with proprietary models (GPT-4o, Gemini-3.1-Pro) while offering superior controllability and domain specialization. Downstream experiments across GUI navigation, video generation, and VQA show that DataClaw 0 -generated data significantly improves task performance under limited data budgets.

Future directions: Scaling the framework to more domains (e.g., 3D, audio), exploring model-free anchor extraction, and further improving the robustness of expert routing for real-time, diverse user intents. The project page is available at https://czjdsg.github.io/MakeAnyData.