Summary (Overview)
- Contribution: LabVLA is a Vision-Language-Action (VLA) model designed for scientific laboratory automation, trained primarily on synthetic data to execute fixed multistep protocols.
- Data Engine: RoboGenesis, a simulation-based workflow and data engine built on Isaac Sim, generates a large-scale, annotated laboratory demonstration corpus (LabEmbodied-Data) with cross-embodiment support, success filtering, and domain randomization.
- Training Recipe: LabVLA uses a two-stage training pipeline: (1) FAST action token pretraining on the Qwen3-VL-4B-Instruct backbone to align vision-language representations with action semantics, and (2) flow matching posttraining with a DiT action expert under knowledge insulation to avoid interference with VLM representations.
- Benchmark Results: On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all baselines: 71.1% in-distribution and 70.0% out-of-distribution, outperforming π0 by 7.8 and 6.8 pp respectively.
- Real-World Transfer: LabVLA demonstrates sim-to-real transfer on a physical Franka platform across four benchtop tasks, achieving competitive performance (86.5% clean in-domain average) comparable to DreamZero.
Introduction and Theoretical Foundation
Scientific laboratories increasingly rely on AI for literature search, hypothesis generation, and protocol planning, but physical execution of experiments (e.g., pipetting, heating, stirring) still requires a human operator. The gap between digital scientific reasoning and real experimental work is one of embodiment.
Vision-Language-Action (VLA) models can interface written protocols with robot execution, but existing VLA policies are trained on household and tabletop demonstrations (e.g., Open X-Embodiment, DROID, BridgeData V2) and lack exposure to laboratory-specific instruments (pipettes, centrifuges, thermal cyclers), transparent liquids, or fixed protocol workflows.
The authors identify data and embodiment as central bottlenecks alongside model design. Laboratory manipulation differs from tabletop manipulation in failure modes (fine spatial precision, contact control, physical state changes like liquid flow and heating). Collecting real laboratory data is expensive due to specialized instruments, domain supervision, and safety procedures.
To address these bottlenecks, the paper presents:
- RoboGenesis: a programmable simulation-based data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across 16 robot profiles.
- LabVLA: a VLA policy trained with FAST action token pretraining (to make the VLM backbone action-aware) and flow matching posttraining with a DiT action expert under knowledge insulation (stop-gradient to protect VLM representations).
Methodology
RoboGenesis: Data Engine
RoboGenesis operates in three stages (Figure 2 in paper):
1. Environment Building
- Asset generation pipeline: Text descriptions → structured prompts → text-to-image API → TRELLIS 2.0 (image-to-3D mesh) → postprocessing (USD format, PBR textures, collision mesh, URDF with mass/friction). Result: LabAssetLibrary of 2,947 annotated assets.
- Automated scene construction: Greedy placement pipeline with six passes (main table, walls with lab counters, bench clusters, floor equipment, shelves/glassware, yaw correction). Ten validation checks; scenes below threshold rejected. Generated 10,000 lab scenes with diversity across topologies, themes, and clutter.
- Robot profiles: 16 single-arm, bimanual, and mobile manipulator configurations (Franka Panda, FR3, UR-series, Piper, Rizon4, Festo, ARX X5/ R5, Split ALOHA, Lift2, FR3 Duo, Ridgebase variants) stored independently from scenes/protocols.
2. Agentic Workflow Generation
- Atomic skill library: pick, place, pour, stir, shake, move, press, pressZ, open, close, navigation skills. Extensible.
- Workflow authoring: Agent-assisted path from natural language instruction to executable YAML workflow; offline validator checks reachability, conflicts, repick risks. Manual YAML path also available.
- Domain randomization: Six configurable axes (scene, camera, lighting, object, spatial, clutter) applied after validation; instruction paraphrasing. Randomization never rewrites protocol semantics.
3. Knowledgeable LabEmbodied-Data
- Exports only successful rollouts (per-skill success checkers: grasp stability, liquid transfer, position tolerance; contact safety monitor).
- Each episode stores: multicamera RGB, robot joint states, actions, language instruction. 15 annotation providers (robot state, camera intrinsics/extrinsics, step timing, instruction alignment, object state, scene relations, object semantics, success explanation, collision events, temporal segments, subgoals, quality scores, intervention flags, episode metadata).
LabVLA Architecture
LabVLA pairs a Qwen3-VL-4B-Instruct backbone with a DiT action expert (18 layers, width 1024, 8 attention heads, head dimension 128). The VLM encodes up to V RGB views , a language instruction , and robot state , producing hidden states:
A linear projection maps to DiT width. The action expert predicts a -step continuous action chunk:
Embodiment-agnostic batch format: state/action vectors padded to with valid mask ; datasets share single batch format.
Training Recipe (Two-Stage)
Stage 1: VLM Pretraining with FAST Tokens
Continuous actions are tokenized with FAST (per-dimension statistics, encoding, padding). Sequence:
where = image tokens, = state-conditioned instruction tokens (state discretized and serialized), = annotation tokens, = FAST action tokens.
Masked next-token prediction loss:
Combined VLM loss (if annotation targets available):
Stage 2: Flow Matching Posttraining
Attach DiT action expert. For flow time , form noisy action and target velocity :
DiT predicts:
Flow loss (masked MSE):
where .
Knowledge Insulation
Stop-gradient on VLM hidden states before DiT cross-attention:
Joint objective:
Inference: Euler integration with steps:
Outputs first continuous actions.
Empirical Validation / Results
LabUtopia Benchmark
Evaluated on six tasks (Pick Up, Press Button, Open Door, Pour Liquid, Heat Beaker, Transport Beaker) under in-distribution (ID) and out-of-distribution (OOD) settings. 120 episodes per task per setting.
Table 2: Success rates (%) on LabUtopia tasks.
| Method | Size | Pick Up | Press Button | Open Door | Pour Liquid | Heat Beaker | Transport Beaker | Avg |
|---|---|---|---|---|---|---|---|---|
| In-Distribution | ||||||||
| SmolVLA | <1B | 15.8 | 97.5 | 16.7 | 0.8 | 96.7 | 85.8 | 52.2 |
| X-VLA | <1B | 27.5 | 98.3 | 65.0 | 45.0 | 25.8 | 83.3 | 57.5 |
| GR00T N1.5 | 3B | 40.8 | 99.2 | 6.7 | 0 | 99.2 | 69.2 | 52.5 |
| π0 | 3B | 21.7 | 92.5 | 51.6 | 37.5 | 90.0 | 86.7 | 63.3 |
| π0.5 | 3B | 38.3 | 60.0 | 55.8 | 29.2 | 40.8 | 90.0 | 52.4 |
| π0-FAST | 3B | 16.7 | 37.5 | 17.5 | 5.8 | 3.3 | 20.8 | 16.9 |
| InternVLA-A1 | 3B | 25.8 | 93.3 | 38.3 | 2.50 | 82.5 | 67.5 | 51.7 |
| Wall-oss-flow | 4B | 11.7 | 54.2 | 0.83 | 0 | 0 | 29.2 | 16.0 |
| LabVLA (ours) | 4B | 49.2 | 100 | 65.0 | 43.3 | 83.3 | 85.8 | 71.1 |
| Out-of-Distribution | ||||||||
| SmolVLA | <1B | 11.7 | 99.2 | 18.3 | 1.67 | 98.3 | 89.2 | 53.1 |
| X-VLA | <1B | 27.5 | 99.2 | 59.2 | 25.0 | 39.2 | 67.5 | 52.9 |
| GR00T N1.5 | 3B | 33.3 | 92.5 | 8.3 | 0 | 99.2 | 66.7 | 50.0 |
| π0 | 3B | 19.2 | 89.1 | 53.3 | 38.3 | 90.8 | 88.3 | 63.2 |
| π0.5 | 3B | 30.0 | 68.3 | 59.2 | 29.2 | 40.0 | 85.8 | 52.1 |
| π0-FAST | 3B | 14.2 | 45.0 | 15.8 | 7.5 | 11.7 | 24.2 | 19.7 |
| InternVLA-A1 | 3B | 19.2 | 95.8 | 63.3 | 0.83 | 84.2 | 57.5 | 53.5 |
| Wall-oss-flow | 4B | 7.50 | 61.7 | 0 | 0 | 0 | 26.7 | 16.0 |
| LabVLA (ours) | 4B | 48.3 | 98.3 | 65.8 | 34.2 | 87.5 | 85.8 | 70.0 |
Key findings:
- LabVLA achieves highest average in both ID (71.1%) and OOD (70.0%).
- LabVLA leads on Pick Up and Open Door, ties for best on Press Button (100% ID).
- Pour Liquid remains hardest for all policies (LabVLA max 43.3% ID).
- LabVLA has narrowest ID→OOD drop (1.1 pp), indicating strong generalization from domain randomization.
Transferability of LabEmbodied-Data
Fine-tuning X-VLA (<1B) on LabEmbodied-Data improves five-task average by +15.0 pp ID and +19.3 pp OOD.
Table 3: Transferability of LabEmbodied-Data to X-VLA.
| Method | Size | Pick Up | Open Door | Pour Liquid | Heat Beaker | Transport Beaker | Avg | Δ |
|---|---|---|---|---|---|---|---|---|
| In-Distribution | ||||||||
| X-VLA | <1B | 27.5 | 65.0 | 45.0 | 25.8 | 83.3 | 49.3 | — |
| X-VLA + LabEmbodied | <1B | 26.7 | 69.2 | 59.2 | 68.3 | 98.3 | 64.3 | +15.0 |
| Out-of-Distribution | ||||||||
| X-VLA | <1B | 27.5 | 59.2 | 25.0 | 39.2 | 67.5 | 43.7 | — |
| X-VLA + LabEmbodied | <1B | 31.7 | 63.3 | 65.0 | 65.0 | 90.0 | 63.0 | +19.3 |
Largest gains on Heat Beaker (ID: 25.8%→68.3%) and Pour Liquid (OOD: 25.0%→65.0%).
Real Robot Experiments
Deployed on physical Franka platform. Four tasks (Shake Liquid
Related papers
- InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker uses decoupled planner-critic agents to enable any frozen image generator to achieve state-of-the-art interleaved generation.
- Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor's hypothesis tree framework achieves best held-out results on all six real research tasks, with over 2.5x the average gain of Codex and Claude Code.
- Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Z-Reward decouples reasoning-heavy judgment from efficient reward deployment, achieving 89.6% teacher and 88.6% student human