Summary (Overview)

  • Contribution: LabVLA is a Vision-Language-Action (VLA) model designed for scientific laboratory automation, trained primarily on synthetic data to execute fixed multistep protocols.
  • Data Engine: RoboGenesis, a simulation-based workflow and data engine built on Isaac Sim, generates a large-scale, annotated laboratory demonstration corpus (LabEmbodied-Data) with cross-embodiment support, success filtering, and domain randomization.
  • Training Recipe: LabVLA uses a two-stage training pipeline: (1) FAST action token pretraining on the Qwen3-VL-4B-Instruct backbone to align vision-language representations with action semantics, and (2) flow matching posttraining with a DiT action expert under knowledge insulation to avoid interference with VLM representations.
  • Benchmark Results: On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all baselines: 71.1% in-distribution and 70.0% out-of-distribution, outperforming π0 by 7.8 and 6.8 pp respectively.
  • Real-World Transfer: LabVLA demonstrates sim-to-real transfer on a physical Franka platform across four benchtop tasks, achieving competitive performance (86.5% clean in-domain average) comparable to DreamZero.

Introduction and Theoretical Foundation

Scientific laboratories increasingly rely on AI for literature search, hypothesis generation, and protocol planning, but physical execution of experiments (e.g., pipetting, heating, stirring) still requires a human operator. The gap between digital scientific reasoning and real experimental work is one of embodiment.

Vision-Language-Action (VLA) models can interface written protocols with robot execution, but existing VLA policies are trained on household and tabletop demonstrations (e.g., Open X-Embodiment, DROID, BridgeData V2) and lack exposure to laboratory-specific instruments (pipettes, centrifuges, thermal cyclers), transparent liquids, or fixed protocol workflows.

The authors identify data and embodiment as central bottlenecks alongside model design. Laboratory manipulation differs from tabletop manipulation in failure modes (fine spatial precision, contact control, physical state changes like liquid flow and heating). Collecting real laboratory data is expensive due to specialized instruments, domain supervision, and safety procedures.

To address these bottlenecks, the paper presents:

  • RoboGenesis: a programmable simulation-based data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across 16 robot profiles.
  • LabVLA: a VLA policy trained with FAST action token pretraining (to make the VLM backbone action-aware) and flow matching posttraining with a DiT action expert under knowledge insulation (stop-gradient to protect VLM representations).

Methodology

RoboGenesis: Data Engine

RoboGenesis operates in three stages (Figure 2 in paper):

1. Environment Building

  • Asset generation pipeline: Text descriptions → structured prompts → text-to-image API → TRELLIS 2.0 (image-to-3D mesh) → postprocessing (USD format, PBR textures, collision mesh, URDF with mass/friction). Result: LabAssetLibrary of 2,947 annotated assets.
  • Automated scene construction: Greedy placement pipeline with six passes (main table, walls with lab counters, bench clusters, floor equipment, shelves/glassware, yaw correction). Ten validation checks; scenes below threshold rejected. Generated 10,000 lab scenes with diversity across topologies, themes, and clutter.
  • Robot profiles: 16 single-arm, bimanual, and mobile manipulator configurations (Franka Panda, FR3, UR-series, Piper, Rizon4, Festo, ARX X5/ R5, Split ALOHA, Lift2, FR3 Duo, Ridgebase variants) stored independently from scenes/protocols.

2. Agentic Workflow Generation

  • Atomic skill library: pick, place, pour, stir, shake, move, press, pressZ, open, close, navigation skills. Extensible.
  • Workflow authoring: Agent-assisted path from natural language instruction to executable YAML workflow; offline validator checks reachability, conflicts, repick risks. Manual YAML path also available.
  • Domain randomization: Six configurable axes (scene, camera, lighting, object, spatial, clutter) applied after validation; instruction paraphrasing. Randomization never rewrites protocol semantics.

3. Knowledgeable LabEmbodied-Data

  • Exports only successful rollouts (per-skill success checkers: grasp stability, liquid transfer, position tolerance; contact safety monitor).
  • Each episode stores: multicamera RGB, robot joint states, actions, language instruction. 15 annotation providers (robot state, camera intrinsics/extrinsics, step timing, instruction alignment, object state, scene relations, object semantics, success explanation, collision events, temporal segments, subgoals, quality scores, intervention flags, episode metadata).

LabVLA Architecture

LabVLA pairs a Qwen3-VL-4B-Instruct backbone with a DiT action expert (18 layers, width 1024, 8 attention heads, head dimension 128). The VLM encodes up to V RGB views It1:VI_t^{1:V}, a language instruction \ell, and robot state qtrq_t^r, producing hidden states:

Hϕ=fϕ(It1:V,)RLh×dvlmH_\phi = f_\phi(I_t^{1:V}, \ell) \in \mathbb{R}^{L_h \times d_{vlm}}

A linear projection Π\Pi maps HϕH_\phi to DiT width. The action expert predicts a KK-step continuous action chunk:

Atr=[atr,,at+K1r]RK×drA_t^r = [a_t^r, \dots, a_{t+K-1}^r] \in \mathbb{R}^{K \times d^r}

Embodiment-agnostic batch format: state/action vectors padded to dmaxd_{\max} with valid mask MactM_{act}; datasets share single batch format.

Training Recipe (Two-Stage)

Stage 1: VLM Pretraining with FAST Tokens

Continuous actions are tokenized with FAST (per-dimension statistics, encoding, padding). Sequence:

Xpre=[vt;ct;yt;z1:Lz]X_{\text{pre}} = [v_t; c_t; y_t; z_{1:L_z}]

where vtv_t = image tokens, ctc_t = state-conditioned instruction tokens (state discretized and serialized), yty_t = annotation tokens, z1:Lzz_{1:L_z} = FAST action tokens.

Masked next-token prediction loss:

LFAST=1i=1Lzmii=1Lzmilogpϕ(zivt,ct,yt,z<i)\mathcal{L}_{\text{FAST}} = -\frac{1}{\sum_{i=1}^{L_z} m_i} \sum_{i=1}^{L_z} m_i \log p_\phi(z_i | v_t, c_t, y_t, z_{<i})

Combined VLM loss (if annotation targets available):

LVLM=LFAST+jλjLCE(j)\mathcal{L}_{\text{VLM}} = \mathcal{L}_{\text{FAST}} + \sum_j \lambda_j \mathcal{L}_{\text{CE}}^{(j)}

Stage 2: Flow Matching Posttraining

Attach DiT action expert. For flow time τ\tau, form noisy action XτX_\tau and target velocity UτU_\tau:

Xτ=τA~tr+(1τ)ϵ,Uτ=A~trϵX_\tau = \tau \tilde{A}_t^r + (1-\tau)\epsilon \quad , \quad U_\tau = \tilde{A}_t^r - \epsilon

DiT predicts:

Vθ=gθ(Xτ,τ,qtr,Π(Hϕ))V_\theta = g_\theta(X_\tau, \tau, q_t^r, \Pi(H_\phi))

Flow loss (masked MSE):

LFM={SM1k,dMk,dact(Vθ,k,dUτ,k,d)2,SM>00,SM=0\mathcal{L}_{\text{FM}} = \begin{cases} S_M^{-1} \sum_{k,d} M_{k,d}^{\text{act}} (V_{\theta,k,d} - U_{\tau,k,d})^2, & S_M > 0 \\ 0, & S_M = 0 \end{cases}

where SM=k,dMk,dactS_M = \sum_{k,d} M_{k,d}^{\text{act}}.

Knowledge Insulation

Stop-gradient on VLM hidden states before DiT cross-attention:

H~ϕ,pKI=sg(slicep(fϕ(XKI)))\tilde{H}_{\phi,p}^{\text{KI}} = \text{sg}(\text{slice}_p(f_\phi(X_{\text{KI}})))

Joint objective:

LKI=αLFM+LFAST+jλjLCE(j),α=10\mathcal{L}_{\text{KI}} = \alpha \mathcal{L}_{\text{FM}} + \mathcal{L}_{\text{FAST}} + \sum_j \lambda_j \mathcal{L}_{\text{CE}}^{(j)}, \quad \alpha = 10

Inference: Euler integration with N=10N=10 steps:

Xτ+Δτ=Xτ+Δτgθ(Xτ,τ,qtr,Π(Hϕ))X_{\tau+\Delta\tau} = X_\tau + \Delta\tau \cdot g_\theta(X_\tau, \tau, q_t^r, \Pi(H_\phi))

Outputs first KK continuous actions.

Empirical Validation / Results

LabUtopia Benchmark

Evaluated on six tasks (Pick Up, Press Button, Open Door, Pour Liquid, Heat Beaker, Transport Beaker) under in-distribution (ID) and out-of-distribution (OOD) settings. 120 episodes per task per setting.

Table 2: Success rates (%) on LabUtopia tasks.

MethodSizePick UpPress ButtonOpen DoorPour LiquidHeat BeakerTransport BeakerAvg
In-Distribution
SmolVLA<1B15.897.516.70.896.785.852.2
X-VLA<1B27.598.365.045.025.883.357.5
GR00T N1.53B40.899.26.7099.269.252.5
π03B21.792.551.637.590.086.763.3
π0.53B38.360.055.829.240.890.052.4
π0-FAST3B16.737.517.55.83.320.816.9
InternVLA-A13B25.893.338.32.5082.567.551.7
Wall-oss-flow4B11.754.20.830029.216.0
LabVLA (ours)4B49.210065.043.383.385.871.1
Out-of-Distribution
SmolVLA<1B11.799.218.31.6798.389.253.1
X-VLA<1B27.599.259.225.039.267.552.9
GR00T N1.53B33.392.58.3099.266.750.0
π03B19.289.153.338.390.888.363.2
π0.53B30.068.359.229.240.085.852.1
π0-FAST3B14.245.015.87.511.724.219.7
InternVLA-A13B19.295.863.30.8384.257.553.5
Wall-oss-flow4B7.5061.700026.716.0
LabVLA (ours)4B48.398.365.834.287.585.870.0

Key findings:

  • LabVLA achieves highest average in both ID (71.1%) and OOD (70.0%).
  • LabVLA leads on Pick Up and Open Door, ties for best on Press Button (100% ID).
  • Pour Liquid remains hardest for all policies (LabVLA max 43.3% ID).
  • LabVLA has narrowest ID→OOD drop (1.1 pp), indicating strong generalization from domain randomization.

Transferability of LabEmbodied-Data

Fine-tuning X-VLA (<1B) on LabEmbodied-Data improves five-task average by +15.0 pp ID and +19.3 pp OOD.

Table 3: Transferability of LabEmbodied-Data to X-VLA.

MethodSizePick UpOpen DoorPour LiquidHeat BeakerTransport BeakerAvgΔ
In-Distribution
X-VLA<1B27.565.045.025.883.349.3
X-VLA + LabEmbodied<1B26.769.259.268.398.364.3+15.0
Out-of-Distribution
X-VLA<1B27.559.225.039.267.543.7
X-VLA + LabEmbodied<1B31.763.365.065.090.063.0+19.3

Largest gains on Heat Beaker (ID: 25.8%→68.3%) and Pour Liquid (OOD: 25.0%→65.0%).

Real Robot Experiments

Deployed on physical Franka platform. Four tasks (Shake Liquid

Related papers