Visual Summary | LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Summary (Overview)

Contribution: LabVLA is a Vision-Language-Action (VLA) model designed for scientific laboratory automation, trained primarily on synthetic data to execute fixed multistep protocols.
Data Engine: RoboGenesis, a simulation-based workflow and data engine built on Isaac Sim, generates a large-scale, annotated laboratory demonstration corpus (LabEmbodied-Data) with cross-embodiment support, success filtering, and domain randomization.
Training Recipe: LabVLA uses a two-stage training pipeline: (1) FAST action token pretraining on the Qwen3-VL-4B-Instruct backbone to align vision-language representations with action semantics, and (2) flow matching posttraining with a DiT action expert under knowledge insulation to avoid interference with VLM representations.
Benchmark Results: On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all baselines: 71.1% in-distribution and 70.0% out-of-distribution, outperforming π0 by 7.8 and 6.8 pp respectively.
Real-World Transfer: LabVLA demonstrates sim-to-real transfer on a physical Franka platform across four benchtop tasks, achieving competitive performance (86.5% clean in-domain average) comparable to DreamZero.

Introduction and Theoretical Foundation

Scientific laboratories increasingly rely on AI for literature search, hypothesis generation, and protocol planning, but physical execution of experiments (e.g., pipetting, heating, stirring) still requires a human operator. The gap between digital scientific reasoning and real experimental work is one of embodiment.

Vision-Language-Action (VLA) models can interface written protocols with robot execution, but existing VLA policies are trained on household and tabletop demonstrations (e.g., Open X-Embodiment, DROID, BridgeData V2) and lack exposure to laboratory-specific instruments (pipettes, centrifuges, thermal cyclers), transparent liquids, or fixed protocol workflows.

The authors identify data and embodiment as central bottlenecks alongside model design. Laboratory manipulation differs from tabletop manipulation in failure modes (fine spatial precision, contact control, physical state changes like liquid flow and heating). Collecting real laboratory data is expensive due to specialized instruments, domain supervision, and safety procedures.

To address these bottlenecks, the paper presents:

RoboGenesis: a programmable simulation-based data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across 16 robot profiles.
LabVLA: a VLA policy trained with FAST action token pretraining (to make the VLM backbone action-aware) and flow matching posttraining with a DiT action expert under knowledge insulation (stop-gradient to protect VLM representations).

Methodology

RoboGenesis: Data Engine

RoboGenesis operates in three stages (Figure 2 in paper):

1. Environment Building

Asset generation pipeline: Text descriptions → structured prompts → text-to-image API → TRELLIS 2.0 (image-to-3D mesh) → postprocessing (USD format, PBR textures, collision mesh, URDF with mass/friction). Result: LabAssetLibrary of 2,947 annotated assets.
Automated scene construction: Greedy placement pipeline with six passes (main table, walls with lab counters, bench clusters, floor equipment, shelves/glassware, yaw correction). Ten validation checks; scenes below threshold rejected. Generated 10,000 lab scenes with diversity across topologies, themes, and clutter.
Robot profiles: 16 single-arm, bimanual, and mobile manipulator configurations (Franka Panda, FR3, UR-series, Piper, Rizon4, Festo, ARX X5/ R5, Split ALOHA, Lift2, FR3 Duo, Ridgebase variants) stored independently from scenes/protocols.

2. Agentic Workflow Generation

Atomic skill library: pick, place, pour, stir, shake, move, press, pressZ, open, close, navigation skills. Extensible.
Workflow authoring: Agent-assisted path from natural language instruction to executable YAML workflow; offline validator checks reachability, conflicts, repick risks. Manual YAML path also available.
Domain randomization: Six configurable axes (scene, camera, lighting, object, spatial, clutter) applied after validation; instruction paraphrasing. Randomization never rewrites protocol semantics.

3. Knowledgeable LabEmbodied-Data

Exports only successful rollouts (per-skill success checkers: grasp stability, liquid transfer, position tolerance; contact safety monitor).
Each episode stores: multicamera RGB, robot joint states, actions, language instruction. 15 annotation providers (robot state, camera intrinsics/extrinsics, step timing, instruction alignment, object state, scene relations, object semantics, success explanation, collision events, temporal segments, subgoals, quality scores, intervention flags, episode metadata).

LabVLA Architecture

LabVLA pairs a Qwen3-VL-4B-Instruct backbone with a DiT action expert (18 layers, width 1024, 8 attention heads, head dimension 128). The VLM encodes up to V RGB views $I_t^{1:V}$ , a language instruction $\ell$ , and robot state $q_t^r$ , producing hidden states:

H_\phi = f_\phi(I_t^{1:V}, \ell) \in \mathbb{R}^{L_h \times d_{vlm}}

A linear projection $\Pi$ maps $H_\phi$ to DiT width. The action expert predicts a $K$ -step continuous action chunk:

A_t^r = [a_t^r, \dots, a_{t+K-1}^r] \in \mathbb{R}^{K \times d^r}

Embodiment-agnostic batch format: state/action vectors padded to $d_{\max}$ with valid mask $M_{act}$ ; datasets share single batch format.

Training Recipe (Two-Stage)

Stage 1: VLM Pretraining with FAST Tokens

Continuous actions are tokenized with FAST (per-dimension statistics, encoding, padding). Sequence:

X_{\text{pre}} = [v_t; c_t; y_t; z_{1:L_z}]

where $v_t$ = image tokens, $c_t$ = state-conditioned instruction tokens (state discretized and serialized), $y_t$ = annotation tokens, $z_{1:L_z}$ = FAST action tokens.

Masked next-token prediction loss:

\mathcal{L}_{\text{FAST}} = -\frac{1}{\sum_{i=1}^{L_z} m_i} \sum_{i=1}^{L_z} m_i \log p_\phi(z_i | v_t, c_t, y_t, z_{<i})

Combined VLM loss (if annotation targets available):

\mathcal{L}_{\text{VLM}} = \mathcal{L}_{\text{FAST}} + \sum_j \lambda_j \mathcal{L}_{\text{CE}}^{(j)}

Stage 2: Flow Matching Posttraining

Attach DiT action expert. For flow time $\tau$ , form noisy action $X_\tau$ and target velocity $U_\tau$ :

X_\tau = \tau \tilde{A}_t^r + (1-\tau)\epsilon \quad , \quad U_\tau = \tilde{A}_t^r - \epsilon

DiT predicts:

V_\theta = g_\theta(X_\tau, \tau, q_t^r, \Pi(H_\phi))

Flow loss (masked MSE):

\mathcal{L}_{\text{FM}} = \begin{cases} S_M^{-1} \sum_{k,d} M_{k,d}^{\text{act}} (V_{\theta,k,d} - U_{\tau,k,d})^2, & S_M > 0 \\ 0, & S_M = 0 \end{cases}

where $S_M = \sum_{k,d} M_{k,d}^{\text{act}}$ .

Knowledge Insulation

Stop-gradient on VLM hidden states before DiT cross-attention:

\tilde{H}_{\phi,p}^{\text{KI}} = \text{sg}(\text{slice}_p(f_\phi(X_{\text{KI}})))

Joint objective:

\mathcal{L}_{\text{KI}} = \alpha \mathcal{L}_{\text{FM}} + \mathcal{L}_{\text{FAST}} + \sum_j \lambda_j \mathcal{L}_{\text{CE}}^{(j)}, \quad \alpha = 10

Inference: Euler integration with $N=10$ steps:

X_{\tau+\Delta\tau} = X_\tau + \Delta\tau \cdot g_\theta(X_\tau, \tau, q_t^r, \Pi(H_\phi))

Outputs first $K$ continuous actions.

Empirical Validation / Results

LabUtopia Benchmark

Evaluated on six tasks (Pick Up, Press Button, Open Door, Pour Liquid, Heat Beaker, Transport Beaker) under in-distribution (ID) and out-of-distribution (OOD) settings. 120 episodes per task per setting.

Table 2: Success rates (%) on LabUtopia tasks.

Method	Size	Pick Up	Press Button	Open Door	Pour Liquid	Heat Beaker	Transport Beaker	Avg
In-Distribution
SmolVLA	<1B	15.8	97.5	16.7	0.8	96.7	85.8	52.2
X-VLA	<1B	27.5	98.3	65.0	45.0	25.8	83.3	57.5
GR00T N1.5	3B	40.8	99.2	6.7	0	99.2	69.2	52.5
π0	3B	21.7	92.5	51.6	37.5	90.0	86.7	63.3
π0.5	3B	38.3	60.0	55.8	29.2	40.8	90.0	52.4
π0-FAST	3B	16.7	37.5	17.5	5.8	3.3	20.8	16.9
InternVLA-A1	3B	25.8	93.3	38.3	2.50	82.5	67.5	51.7
Wall-oss-flow	4B	11.7	54.2	0.83	0	0	29.2	16.0
LabVLA (ours)	4B	49.2	100	65.0	43.3	83.3	85.8	71.1
Out-of-Distribution
SmolVLA	<1B	11.7	99.2	18.3	1.67	98.3	89.2	53.1
X-VLA	<1B	27.5	99.2	59.2	25.0	39.2	67.5	52.9
GR00T N1.5	3B	33.3	92.5	8.3	0	99.2	66.7	50.0
π0	3B	19.2	89.1	53.3	38.3	90.8	88.3	63.2
π0.5	3B	30.0	68.3	59.2	29.2	40.0	85.8	52.1
π0-FAST	3B	14.2	45.0	15.8	7.5	11.7	24.2	19.7
InternVLA-A1	3B	19.2	95.8	63.3	0.83	84.2	57.5	53.5
Wall-oss-flow	4B	7.50	61.7	0	0	0	26.7	16.0
LabVLA (ours)	4B	48.3	98.3	65.8	34.2	87.5	85.8	70.0

Key findings:

LabVLA achieves highest average in both ID (71.1%) and OOD (70.0%).
LabVLA leads on Pick Up and Open Door, ties for best on Press Button (100% ID).
Pour Liquid remains hardest for all policies (LabVLA max 43.3% ID).
LabVLA has narrowest ID→OOD drop (1.1 pp), indicating strong generalization from domain randomization.

Transferability of LabEmbodied-Data

Fine-tuning X-VLA (<1B) on LabEmbodied-Data improves five-task average by +15.0 pp ID and +19.3 pp OOD.

Table 3: Transferability of LabEmbodied-Data to X-VLA.

Method	Size	Pick Up	Open Door	Pour Liquid	Heat Beaker	Transport Beaker	Avg	Δ
In-Distribution
X-VLA	<1B	27.5	65.0	45.0	25.8	83.3	49.3	—
X-VLA + LabEmbodied	<1B	26.7	69.2	59.2	68.3	98.3	64.3	+15.0
Out-of-Distribution
X-VLA	<1B	27.5	59.2	25.0	39.2	67.5	43.7	—
X-VLA + LabEmbodied	<1B	31.7	63.3	65.0	65.0	90.0	63.0	+19.3

Largest gains on Heat Beaker (ID: 25.8%→68.3%) and Pour Liquid (OOD: 25.0%→65.0%).

Real Robot Experiments

Deployed on physical Franka platform. Four tasks (Shake Liquid