Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Summary (Overview)
- Unified Embodied Foundation Model: Qwen-VLA extends the Qwen vision-language model to continuous action generation via a Diffusion Transformer (DiT)-based action decoder, unifying manipulation, navigation, and trajectory prediction tasks into a single framework.
- Embodiment-aware Prompt Conditioning: A novel method enables support for multiple robot platforms within a shared model by prepending robot-specific textual descriptions to specify embodiment and control conventions.
- Large-scale Joint Pretraining: The model is trained on a diverse mixture of data including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation, navigation data, and auxiliary vision-language data.
- Progressive Training Recipe: A four-stage training approach (Text-to-Action DiT pretraining, Continued Pretraining, Supervised Fine-Tuning, Reinforcement Learning) stabilizes learning and improves transfer.
- Strong Generalist Performance: Qwen-VLA achieves state-of-the-art or competitive results across multiple benchmarks (LIBERO, Simpler-WidowX, RoboTwin, R2R, RxR) and demonstrates robust out-of-distribution generalization in real-world and dynamic manipulation tasks.
Introduction and Theoretical Foundation
Embodied intelligence aims to build agents that perceive the physical world, understand language instructions, reason over context, and execute actions. Current systems are often specialized to narrow task families, robot embodiments, or settings (e.g., manipulation vs. navigation), limiting transfer and scalability.
Core Insight: Despite heterogeneous output formats (end-effector poses, joint positions, waypoints), embodied tasks share a common computational structure: an agent must condition on visual observations, language instructions, and embodiment constraints to predict future actions/trajectories.
Motivation: This motivates a unified formulation. Qwen-VLA leverages this insight to design a joint pretraining framework that absorbs diverse embodied data into a single vision-language-action model, enabling generalization across embodiments and tasks.
Formulation: The problem is formulated as a unified conditional prediction framework. At time step , the model receives:
- Visual context
- Language instruction
- Embodiment description
- Optional task identifier
The model predicts a target sequence over horizon :
The target is task-dependent but represented in a unified action-and-trajectory space (e.g., robot actions, navigation waypoints, human motion trajectories).
Methodology
Model Architecture
Vision-language backbone: Qwen3.5, a natively multimodal model with early vision-language fusion (visual tokens interleaved with text). Uses hybrid attention (gated linear + grouped-query softmax).
Action expert: A single-stream DiT-style flow-matching policy attached to the backbone. It concatenates VLM hidden states with a noisy action chunk and processes them through joint self-attention with AdaLN timestep conditioning and multi-section RoPE.
Parameters: Action expert contains ~1.15B parameters: 16 DiT blocks (70.8M each, 1.13B combined), action projection MLPs (4.9M), linear layer for VLM states (3.9M), timestep embedding (2.8M), output AdaLN modulation (4.7M).
Embodiment-aware Prompt Conditioning
A textual prompt prepended to each training example specifies the current platform, configuration, and control convention. Template:
The robot is {robot_tag} with {single arm / dual arms}[, waist][, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.
Unified Action and Trajectory Representation
Control signal types: Two families covered:
- Manipulation signals: Delta end-effector position (), end-effector rotation (Euler/quaternions), absolute joint positions, gripper aperture, dexterous-hand joint angles.
- Navigation trajectory signals: VLN convention represented as () per waypoint.
Channel layout: Target tensor , where is fixed prediction horizon, is fixed channel dimension shared across control modes. For a control mode using channels, the relevant values occupy the leading dimensions; remaining dimensions are zero-padded. A per-channel binary mask records valid signals: iff channel and time step falls within the task's chunk length .
Training Objectives
Flow-matching action loss: For samples with continuous control targets, supervised with conditional flow-matching objective. Given clean target and noise , form linear interpolant with . Train expert to predict conditional velocity field.
To avoid gradient dominated by padding, apply per-channel, per-step loss with two-level averaging. For sample with active channels and time steps, mask defined as above. Compute mean squared error for each active channel :
Then average uniformly over active channels:
Vision-language loss: Standard next-token prediction loss on auxiliary vision-language data:
Joint objective: Weighted combination:
Training Recipe
A four-stage progressive recipe:
Stage I: Text-to-action DiT pretraining (T2A): Freeze VLM, train only DiT conditioned on text and embodiment prompt, but without images. Decoder learns language-to-action decompression, building structured action prior.
Stage II: Continued pretraining (CPT): Unfreeze both modules, train on heterogeneous mixture (Section 3.2). Focuses on grounding actions in visual observations.
Stage III: Supervised fine-tuning (SFT): Branch into two parallel tracks:
- Multi-task SFT on heterogeneous tasks (VQA, spatial grounding, manipulation, navigation).
- Fine-tuning on in-house teleoperation data for real-world deployment.
Stage IV: Reinforcement learning (RL): Starting from multi-task SFT checkpoint, fine-tune with sparse binary success rewards in simulation (SimplerEnv). Uses PPO with GAE. Policy optimization objective:
where , is GAE advantage estimate, . Total loss:
with .
Log-probability estimation under flow matching: Convert deterministic probability-flow ODE into corresponding SDE by injecting controlled noise at each Euler denoising step, enabling analytic computation of Gaussian log-probability.
Pretraining Data
Large-scale heterogeneous mixture spanning five families:
Table 1: Pretraining data mixture composition.
| Data Source | Proportion (%) |
|---|---|
| Robot Manipulation Trajectories | 74.2 |
| Human Egocentric Trajectories | 6.0 |
| Navigation Trajectories | 7.5 |
| Synthetic Simulation Trajectories (ours) | 3.7 |
| General Vision-Language Data | 3.4 |
| Spatial Grounding (2D) | 2.5 |
| Autonomous Driving VQA | 2.4 |
| Fine-Grained Embodied Action Caption | 0.2 |
| Total | 100.0 |
Robot Manipulation Trajectories: Core of pretraining corpus (74.2%). Includes public datasets (RobotSet, Galaxea, AgiBot World, RoboCOIN, RoboMIND V1/V2, RDT-1B, DROID, BridgeData V2, RH20T, RT-1, BC-Z) and proprietary datasets. Covers tabletop, mobile, bimanual, dexterous control.
Embodiment-aware prompt conditioning: Each example prepended with robot-specific prompt.
Table 2: Representative robot embodiments in the pretraining corpus.
| Robot | Arms | Action type |
|---|---|---|
| WidowX | Single | EEF + G |
| Google Robot | Single | EEF + G |
| Franka Panda | Single / Dual | EEF + G; Abs Joint + G |
| ARX5 | Dual | EEF + G |
| Fourier GR-1 | Dual | EEF + G |
| Mobile ALOHA | Dual | EEF + G; Abs Joint + G |
| AgiBot A2-D | Dual | Abs Joint + G; Abs Joint + DH |
| Galaxea R1 | Dual | Abs Joint + G |
| AIRBOT MMK2 | Dual | Abs Joint + DH |
| TienKung | Dual | Abs Joint + G; Abs Joint + DH |
| Real Human | Dual | EEF (from MANO) |
Action representation: Preserve each dataset's original action format. Normalize per dataset using quantile statistics:
clipped to .
Egocentric Human Data (6.0%): From Ego4D, EPIC-KITCHENS (processed by VITRA), EgoDex, EgoVerse, Xperience. Action representation: for each hand, wrist motion as SE(3) transformation (6 dimensions), hand articulation via PCA on 45-dimensional axis-angle joint pose, retaining first 10 principal components ("eigengrasps"). Total: 32 action dimensions per time step.
Synthetic Simulation Data (3.7%): Two components:
- Vision-language-action data: Generated via internal pipeline (ROBOINF). Contains 359,848 full successful trajectories including subtask segments.
- Language-action data: Text-only action dataset for semantic/behavioral pretraining. Six task template families (pick-and-place, pushing, pulling, rotation, rotation toward viewpoint, positional swapping). ~7.2M trajectories (~14,000 hours).
Navigation Data (7.5%): Instruction-following (4.3%), object-searching (2.3%), target-tracking (1.0%).
Vision-language Data (8.5% combined):
- Fine-grained embodied action caption (0.2%): ~48,000 video-caption pairs with dense step-by-step descriptions.
- Autonomous driving VQA (2.4%): Focuses on temporal scene understanding, surround-view spatial reasoning, language-grounded localization, planning-aware reasoning.
- Spatial grounding (2.5%): 2D bounding box grounding data.
- General vision-language data (3.4%): Curated mixture for robust visual perception and language grounding.
Empirical Validation / Results
Main Results
Table 4: Robot manipulation results across benchmarks: specialists vs. a single generalist.
| Method | Type | LIBERO | RoboCasa-GR1 | Simpler-WidowX | RoboTwin-Easy | RoboTwin-Hard |
|---|---|---|---|---|---|---|
| (Black et al., 2024) | Specialist | 94.4 | – | – | 65.9 | 58.4 |
| StarVLA-OFT (Community, 2026) | Specialist | 96.6 | 48.8 | 64.6 | 50.4 | – |
| GR00T N1.6 (NVIDIA et al., 2025) | Specialist | 97.2 | 49.9 | 63.2 | 47.6 | – |
| (Black et al., 2025) | Specialist | 97.6 | 37.0 | 46.9 | 82.7 | 76.8 |
| ABot-M0 (Yang et al., 2026) | Specialist | 98.6 | 58.3 | – | 86.0 | 85.0 |
| Being-H0.5 (Luo et al., 2026) | Specialist | 97.6 | 53.3 | – | – | – |
| Qwen-VLA-Base | Generalist | 90.8 | 40.4 | 64.3 | 64.3 | 66.4 |
| Qwen-VLA-Instruct | Generalist | 97.9 | 56.7 | 73.7 | 86.1 | 87.2 |
Key Findings:
- Single generalist (Qwen-VLA-Instruct) outperforms most specialists across benchmarks.
- Pretraining provides strong foundation; instruction tuning yields substantial gains (+7.1% LIBERO, +16.3% RoboCasa-GR1, +9.4% Simpler-WidowX, +21.8% RoboTwin-Easy, +20.8% RoboTwin-Hard).
Real-World Manipulation Results (ALOHA)
Table 5: In-domain performance across short-horizon and long-horizon task categories.
| Model | Pick and Place | Table Cleaning | Bowl Stacking | Bowl Pick & Place | Towel Folding | Fine-grained Manipulation | Avg. |
|---|---|---|---|---|---|---|---|
| GR00T N1.6 (NVIDIA et al., 2025) | 30.8 | 38.5 | 53.8 | 19.2 | 19.2 | 10.3 | 28.6 |
| (Black et al., 2025) | 73.1 | 84.6 | 88.5 | 69.2 | 80.8 | 33.3 | 71.6 |
| Qwen-VLA-aloha w/o pretrain | 30.8 | 53.8 | 61.5 | 64.1 | 50.0 | 30.8 | 48.5 |
| Qwen-VLA-aloha w/ pretrain | 96.2 | 92.3 | 98.7 | 87.2 | 65.4 | 61.5 | 83.6 |
Table 6: OOD performance across generalization categories.
| Model | Color | Instance | Position | Background | Instruction | Avg. |
|---|---|---|---|---|---|---|
| GR00T N1.6 (NVIDIA et al., 2025) | 46.2 | 38.5 | 3.8 | 19.2 | 19.2 | 25.4 |
| (Black et al., 2025) | 57.7 | 61.5 | 19.2 | 26.9 | 42.3 | 41.5 |
| Qwen-VLA-aloha w/o pretrain | 42.3 | 30.8 | 34.6 | 30.8 | 42.3 | 36.2 |
| Qwen-VLA-aloha w/ pretrain | 88.5 | 76.9 | 53.8 | 80.8 | 84.6 | 76.9 |
Key Findings:
- Pretraining provides critical foundation: Qwen-VLA-aloha w/ pretrain achieves 83.6% average success vs. 48.5% for w/o pretrain.
- Strong OOD generalization: 76.9% average OOD success, outperforming by 35.4 percentage points.
Navigation Results
Table 7: Comparison with open-source baselines on VLN-CE.
| Method | R2R Val-Unseen | RxR Val-Unseen |
|---|---|---|
| NE ↓ OS ↑ SR ↑ SPL ↑ | NE ↓ SR ↑ SPL ↑ nDTW ↑ | |
| NaVid (Zhang et al., 2024) | 5.7 49.2 41.9 36.5 | 5.7 45.7 38.2 – |
| Uni-NaVid (Zhang et al., 2025b) | 5.6 53.3 47.0 42.7 | 6.2 48.7 |