Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Summary (Overview)

Unified Embodied Foundation Model: Qwen-VLA extends the Qwen vision-language model to continuous action generation via a Diffusion Transformer (DiT)-based action decoder, unifying manipulation, navigation, and trajectory prediction tasks into a single framework.
Embodiment-aware Prompt Conditioning: A novel method enables support for multiple robot platforms within a shared model by prepending robot-specific textual descriptions to specify embodiment and control conventions.
Large-scale Joint Pretraining: The model is trained on a diverse mixture of data including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation, navigation data, and auxiliary vision-language data.
Progressive Training Recipe: A four-stage training approach (Text-to-Action DiT pretraining, Continued Pretraining, Supervised Fine-Tuning, Reinforcement Learning) stabilizes learning and improves transfer.
Strong Generalist Performance: Qwen-VLA achieves state-of-the-art or competitive results across multiple benchmarks (LIBERO, Simpler-WidowX, RoboTwin, R2R, RxR) and demonstrates robust out-of-distribution generalization in real-world and dynamic manipulation tasks.

Introduction and Theoretical Foundation

Embodied intelligence aims to build agents that perceive the physical world, understand language instructions, reason over context, and execute actions. Current systems are often specialized to narrow task families, robot embodiments, or settings (e.g., manipulation vs. navigation), limiting transfer and scalability.

Core Insight: Despite heterogeneous output formats (end-effector poses, joint positions, waypoints), embodied tasks share a common computational structure: an agent must condition on visual observations, language instructions, and embodiment constraints to predict future actions/trajectories.

Motivation: This motivates a unified formulation. Qwen-VLA leverages this insight to design a joint pretraining framework that absorbs diverse embodied data into a single vision-language-action model, enabling generalization across embodiments and tasks.

Formulation: The problem is formulated as a unified conditional prediction framework. At time step $t$ , the model receives:

Visual context $o_t$
Language instruction $x$
Embodiment description $e$
Optional task identifier $z$

The model predicts a target sequence $y_{t:t+H-1}$ over horizon $H$ :

p_\theta(y_{t:t+H-1} | o_t, x, e, z)

The target $y_{t:t+H-1}$ is task-dependent but represented in a unified action-and-trajectory space (e.g., robot actions, navigation waypoints, human motion trajectories).

Methodology

Model Architecture

Vision-language backbone: Qwen3.5, a natively multimodal model with early vision-language fusion (visual tokens interleaved with text). Uses hybrid attention (gated linear + grouped-query softmax).

Action expert: A single-stream DiT-style flow-matching policy attached to the backbone. It concatenates VLM hidden states with a noisy action chunk and processes them through joint self-attention with AdaLN timestep conditioning and multi-section RoPE.

Parameters: Action expert contains ~1.15B parameters: 16 DiT blocks (70.8M each, 1.13B combined), action projection MLPs (4.9M), linear layer for VLM states (3.9M), timestep embedding (2.8M), output AdaLN modulation (4.7M).

Embodiment-aware Prompt Conditioning

A textual prompt prepended to each training example specifies the current platform, configuration, and control convention. Template:

The robot is {robot_tag} with {single arm / dual arms}[, waist][, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.

Unified Action and Trajectory Representation

Control signal types: Two families covered:

Manipulation signals: Delta end-effector position ( $\Delta x, \Delta y, \Delta z$ ), end-effector rotation (Euler/quaternions), absolute joint positions, gripper aperture, dexterous-hand joint angles.
Navigation trajectory signals: VLN convention represented as ( $\Delta x, \Delta y, \Delta \theta$ ) per waypoint.

Channel layout: Target tensor $Y \in \mathbb{R}^{H \times K}$ , where $H$ is fixed prediction horizon, $K$ is fixed channel dimension shared across control modes. For a control mode using $c \le K$ channels, the relevant values occupy the leading $c$ dimensions; remaining $K-c$ dimensions are zero-padded. A per-channel binary mask $M \in \{0,1\}^{H \times K}$ records valid signals: $M_{h,k} = 1$ iff channel $k < c$ and time step $h$ falls within the task's chunk length $H_{\text{task}} \le H$ .

Training Objectives

Flow-matching action loss: For samples with continuous control targets, supervised with conditional flow-matching objective. Given clean target $Y_0 \in \mathbb{R}^{H \times K}$ and noise $Y_1 \sim \mathcal{N}(0, I)$ , form linear interpolant $Y_\tau = (1-\tau)Y_0 + \tau Y_1$ with $\tau \in [0,1]$ . Train expert $v_\theta$ to predict conditional velocity field.

To avoid gradient dominated by padding, apply per-channel, per-step loss with two-level averaging. For sample with $c$ active channels and $H_{\text{task}}$ time steps, mask $M$ defined as above. Compute mean squared error for each active channel $k < c$ :

\ell_k = \frac{\sum_{h=1}^{H} M_{h,k} \big\| v_\theta(Y_\tau, \tau | o_{1:t}, x, e, z) - (Y_1 - Y_0) \big\|_{h,k}^2}{\sum_{h=1}^{H} M_{h,k}}

Then average uniformly over $c$ active channels:

L_{\text{act}} = \mathbb{E}_{\tau, Y_0, Y_1} \left[ \frac{1}{c} \sum_{k=0}^{c-1} \ell_k \right]

Vision-language loss: Standard next-token prediction loss on auxiliary vision-language data:

L_{\text{vl}} = -\sum_i \log p_\theta(w_i | w_{<i}, o_{1:t})

Joint objective: Weighted combination:

L = \lambda_{\text{act}} L_{\text{act}} + \lambda_{\text{vl}} L_{\text{vl}}

Training Recipe

A four-stage progressive recipe:

Stage I: Text-to-action DiT pretraining (T2A): Freeze VLM, train only DiT conditioned on text and embodiment prompt, but without images. Decoder learns language-to-action decompression, building structured action prior.

Stage II: Continued pretraining (CPT): Unfreeze both modules, train on heterogeneous mixture (Section 3.2). Focuses on grounding actions in visual observations.

Stage III: Supervised fine-tuning (SFT): Branch into two parallel tracks:

Multi-task SFT on heterogeneous tasks (VQA, spatial grounding, manipulation, navigation).
Fine-tuning on in-house teleoperation data for real-world deployment.

Stage IV: Reinforcement learning (RL): Starting from multi-task SFT checkpoint, fine-tune with sparse binary success rewards in simulation (SimplerEnv). Uses PPO with GAE. Policy optimization objective:

L_{\text{actor}}(\theta) = -\mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]

where $r_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_{\text{old}}(a_t | s_t)}$ , $\hat{A}_t$ is GAE advantage estimate, $\epsilon=0.2$ . Total loss:

L(\theta) = L_{\text{actor}}(\theta) + c_v L_{\text{value}}(\theta)

with $c_v=1$ .

Log-probability estimation under flow matching: Convert deterministic probability-flow ODE into corresponding SDE by injecting controlled noise at each Euler denoising step, enabling analytic computation of Gaussian log-probability.

Pretraining Data

Large-scale heterogeneous mixture spanning five families:

Table 1: Pretraining data mixture composition.

Data Source	Proportion (%)
Robot Manipulation Trajectories	74.2
Human Egocentric Trajectories	6.0
Navigation Trajectories	7.5
Synthetic Simulation Trajectories (ours)	3.7
General Vision-Language Data	3.4
Spatial Grounding (2D)	2.5
Autonomous Driving VQA	2.4
Fine-Grained Embodied Action Caption	0.2
Total	100.0

Robot Manipulation Trajectories: Core of pretraining corpus (74.2%). Includes public datasets (RobotSet, Galaxea, AgiBot World, RoboCOIN, RoboMIND V1/V2, RDT-1B, DROID, BridgeData V2, RH20T, RT-1, BC-Z) and proprietary datasets. Covers tabletop, mobile, bimanual, dexterous control.

Embodiment-aware prompt conditioning: Each example prepended with robot-specific prompt.

Table 2: Representative robot embodiments in the pretraining corpus.

Robot	Arms	Action type
WidowX	Single	$\Delta$ EEF + G
Google Robot	Single	$\Delta$ EEF + G
Franka Panda	Single / Dual	$\Delta$ EEF + G; Abs Joint + G
ARX5	Dual	$\Delta$ EEF + G
Fourier GR-1	Dual	$\Delta$ EEF + G
Mobile ALOHA	Dual	$\Delta$ EEF + G; Abs Joint + G
AgiBot A2-D	Dual	Abs Joint + G; Abs Joint + DH
Galaxea R1	Dual	Abs Joint + G
AIRBOT MMK2	Dual	Abs Joint + DH
TienKung	Dual	Abs Joint + G; Abs Joint + DH
Real Human	Dual	$\Delta$ EEF (from MANO)

Action representation: Preserve each dataset's original action format. Normalize per dataset using quantile statistics:

\tilde{a}_d = 2 \cdot \frac{a_d - q^k_{01}}{q^k_{99} - q^k_{01}} - 1

clipped to $[-1, 1]$ .

Egocentric Human Data (6.0%): From Ego4D, EPIC-KITCHENS (processed by VITRA), EgoDex, EgoVerse, Xperience. Action representation: for each hand, wrist motion as SE(3) transformation (6 dimensions), hand articulation via PCA on 45-dimensional axis-angle joint pose, retaining first 10 principal components ("eigengrasps"). Total: 32 action dimensions per time step.

Synthetic Simulation Data (3.7%): Two components:

Vision-language-action data: Generated via internal pipeline (ROBOINF). Contains 359,848 full successful trajectories including subtask segments.
Language-action data: Text-only action dataset for semantic/behavioral pretraining. Six task template families (pick-and-place, pushing, pulling, rotation, rotation toward viewpoint, positional swapping). ~7.2M trajectories (~14,000 hours).

Navigation Data (7.5%): Instruction-following (4.3%), object-searching (2.3%), target-tracking (1.0%).

Vision-language Data (8.5% combined):

Fine-grained embodied action caption (0.2%): ~48,000 video-caption pairs with dense step-by-step descriptions.
Autonomous driving VQA (2.4%): Focuses on temporal scene understanding, surround-view spatial reasoning, language-grounded localization, planning-aware reasoning.
Spatial grounding (2.5%): 2D bounding box grounding data.
General vision-language data (3.4%): Curated mixture for robust visual perception and language grounding.

Empirical Validation / Results

Main Results

Table 4: Robot manipulation results across benchmarks: specialists vs. a single generalist.

Method	Type	LIBERO	RoboCasa-GR1	Simpler-WidowX	RoboTwin-Easy	RoboTwin-Hard
$\pi_0$ (Black et al., 2024)	Specialist	94.4	–	–	65.9	58.4
StarVLA-OFT (Community, 2026)	Specialist	96.6	48.8	64.6	50.4	–
GR00T N1.6 (NVIDIA et al., 2025)	Specialist	97.2	49.9	63.2	47.6	–
$\pi_{0.5}$ (Black et al., 2025)	Specialist	97.6	37.0	46.9	82.7	76.8
ABot-M0 (Yang et al., 2026)	Specialist	98.6	58.3	–	86.0	85.0
Being-H0.5 (Luo et al., 2026)	Specialist	97.6	53.3	–	–	–
Qwen-VLA-Base	Generalist	90.8	40.4	64.3	64.3	66.4
Qwen-VLA-Instruct	Generalist	97.9	56.7	73.7	86.1	87.2

Key Findings:

Single generalist (Qwen-VLA-Instruct) outperforms most specialists across benchmarks.
Pretraining provides strong foundation; instruction tuning yields substantial gains (+7.1% LIBERO, +16.3% RoboCasa-GR1, +9.4% Simpler-WidowX, +21.8% RoboTwin-Easy, +20.8% RoboTwin-Hard).

Real-World Manipulation Results (ALOHA)

Table 5: In-domain performance across short-horizon and long-horizon task categories.

Model	Pick and Place	Table Cleaning	Bowl Stacking	Bowl Pick & Place	Towel Folding	Fine-grained Manipulation	Avg.
GR00T N1.6 (NVIDIA et al., 2025)	30.8	38.5	53.8	19.2	19.2	10.3	28.6
$\pi_{0.5}$ (Black et al., 2025)	73.1	84.6	88.5	69.2	80.8	33.3	71.6
Qwen-VLA-aloha w/o pretrain	30.8	53.8	61.5	64.1	50.0	30.8	48.5
Qwen-VLA-aloha w/ pretrain	96.2	92.3	98.7	87.2	65.4	61.5	83.6

Table 6: OOD performance across generalization categories.

Model	Color	Instance	Position	Background	Instruction	Avg.
GR00T N1.6 (NVIDIA et al., 2025)	46.2	38.5	3.8	19.2	19.2	25.4
$\pi_{0.5}$ (Black et al., 2025)	57.7	61.5	19.2	26.9	42.3	41.5
Qwen-VLA-aloha w/o pretrain	42.3	30.8	34.6	30.8	42.3	36.2
Qwen-VLA-aloha w/ pretrain	88.5	76.9	53.8	80.8	84.6	76.9