Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Summary (Overview)

  • Unified Embodied Foundation Model: Qwen-VLA extends the Qwen vision-language model to continuous action generation via a Diffusion Transformer (DiT)-based action decoder, unifying manipulation, navigation, and trajectory prediction tasks into a single framework.
  • Embodiment-aware Prompt Conditioning: A novel method enables support for multiple robot platforms within a shared model by prepending robot-specific textual descriptions to specify embodiment and control conventions.
  • Large-scale Joint Pretraining: The model is trained on a diverse mixture of data including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation, navigation data, and auxiliary vision-language data.
  • Progressive Training Recipe: A four-stage training approach (Text-to-Action DiT pretraining, Continued Pretraining, Supervised Fine-Tuning, Reinforcement Learning) stabilizes learning and improves transfer.
  • Strong Generalist Performance: Qwen-VLA achieves state-of-the-art or competitive results across multiple benchmarks (LIBERO, Simpler-WidowX, RoboTwin, R2R, RxR) and demonstrates robust out-of-distribution generalization in real-world and dynamic manipulation tasks.

Introduction and Theoretical Foundation

Embodied intelligence aims to build agents that perceive the physical world, understand language instructions, reason over context, and execute actions. Current systems are often specialized to narrow task families, robot embodiments, or settings (e.g., manipulation vs. navigation), limiting transfer and scalability.

Core Insight: Despite heterogeneous output formats (end-effector poses, joint positions, waypoints), embodied tasks share a common computational structure: an agent must condition on visual observations, language instructions, and embodiment constraints to predict future actions/trajectories.

Motivation: This motivates a unified formulation. Qwen-VLA leverages this insight to design a joint pretraining framework that absorbs diverse embodied data into a single vision-language-action model, enabling generalization across embodiments and tasks.

Formulation: The problem is formulated as a unified conditional prediction framework. At time step tt, the model receives:

  • Visual context oto_t
  • Language instruction xx
  • Embodiment description ee
  • Optional task identifier zz

The model predicts a target sequence yt:t+H1y_{t:t+H-1} over horizon HH:

pθ(yt:t+H1ot,x,e,z)p_\theta(y_{t:t+H-1} | o_t, x, e, z)

The target yt:t+H1y_{t:t+H-1} is task-dependent but represented in a unified action-and-trajectory space (e.g., robot actions, navigation waypoints, human motion trajectories).

Methodology

Model Architecture

Vision-language backbone: Qwen3.5, a natively multimodal model with early vision-language fusion (visual tokens interleaved with text). Uses hybrid attention (gated linear + grouped-query softmax).

Action expert: A single-stream DiT-style flow-matching policy attached to the backbone. It concatenates VLM hidden states with a noisy action chunk and processes them through joint self-attention with AdaLN timestep conditioning and multi-section RoPE.

Parameters: Action expert contains ~1.15B parameters: 16 DiT blocks (70.8M each, 1.13B combined), action projection MLPs (4.9M), linear layer for VLM states (3.9M), timestep embedding (2.8M), output AdaLN modulation (4.7M).

Embodiment-aware Prompt Conditioning

A textual prompt prepended to each training example specifies the current platform, configuration, and control convention. Template:

The robot is {robot_tag} with {single arm / dual arms}[, waist][, and mobile base]. The control frequency is {FPS} Hz. Please predict the next {chunk_size} control actions to execute the following task: {ori_instruction}.

Unified Action and Trajectory Representation

Control signal types: Two families covered:

  1. Manipulation signals: Delta end-effector position (Δx,Δy,Δz\Delta x, \Delta y, \Delta z), end-effector rotation (Euler/quaternions), absolute joint positions, gripper aperture, dexterous-hand joint angles.
  2. Navigation trajectory signals: VLN convention represented as (Δx,Δy,Δθ\Delta x, \Delta y, \Delta \theta) per waypoint.

Channel layout: Target tensor YRH×KY \in \mathbb{R}^{H \times K}, where HH is fixed prediction horizon, KK is fixed channel dimension shared across control modes. For a control mode using cKc \le K channels, the relevant values occupy the leading cc dimensions; remaining KcK-c dimensions are zero-padded. A per-channel binary mask M{0,1}H×KM \in \{0,1\}^{H \times K} records valid signals: Mh,k=1M_{h,k} = 1 iff channel k<ck < c and time step hh falls within the task's chunk length HtaskHH_{\text{task}} \le H.

Training Objectives

Flow-matching action loss: For samples with continuous control targets, supervised with conditional flow-matching objective. Given clean target Y0RH×KY_0 \in \mathbb{R}^{H \times K} and noise Y1N(0,I)Y_1 \sim \mathcal{N}(0, I), form linear interpolant Yτ=(1τ)Y0+τY1Y_\tau = (1-\tau)Y_0 + \tau Y_1 with τ[0,1]\tau \in [0,1]. Train expert vθv_\theta to predict conditional velocity field.

To avoid gradient dominated by padding, apply per-channel, per-step loss with two-level averaging. For sample with cc active channels and HtaskH_{\text{task}} time steps, mask MM defined as above. Compute mean squared error for each active channel k<ck < c:

k=h=1HMh,kvθ(Yτ,τo1:t,x,e,z)(Y1Y0)h,k2h=1HMh,k\ell_k = \frac{\sum_{h=1}^{H} M_{h,k} \big\| v_\theta(Y_\tau, \tau | o_{1:t}, x, e, z) - (Y_1 - Y_0) \big\|_{h,k}^2}{\sum_{h=1}^{H} M_{h,k}}

Then average uniformly over cc active channels:

Lact=Eτ,Y0,Y1[1ck=0c1k]L_{\text{act}} = \mathbb{E}_{\tau, Y_0, Y_1} \left[ \frac{1}{c} \sum_{k=0}^{c-1} \ell_k \right]

Vision-language loss: Standard next-token prediction loss on auxiliary vision-language data:

Lvl=ilogpθ(wiw<i,o1:t)L_{\text{vl}} = -\sum_i \log p_\theta(w_i | w_{<i}, o_{1:t})

Joint objective: Weighted combination:

L=λactLact+λvlLvlL = \lambda_{\text{act}} L_{\text{act}} + \lambda_{\text{vl}} L_{\text{vl}}

Training Recipe

A four-stage progressive recipe:

Stage I: Text-to-action DiT pretraining (T2A): Freeze VLM, train only DiT conditioned on text and embodiment prompt, but without images. Decoder learns language-to-action decompression, building structured action prior.

Stage II: Continued pretraining (CPT): Unfreeze both modules, train on heterogeneous mixture (Section 3.2). Focuses on grounding actions in visual observations.

Stage III: Supervised fine-tuning (SFT): Branch into two parallel tracks:

  1. Multi-task SFT on heterogeneous tasks (VQA, spatial grounding, manipulation, navigation).
  2. Fine-tuning on in-house teleoperation data for real-world deployment.

Stage IV: Reinforcement learning (RL): Starting from multi-task SFT checkpoint, fine-tune with sparse binary success rewards in simulation (SimplerEnv). Uses PPO with GAE. Policy optimization objective:

Lactor(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L_{\text{actor}}(\theta) = -\mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t | s_t) / \pi_{\theta_{\text{old}}(a_t | s_t)}, A^t\hat{A}_t is GAE advantage estimate, ϵ=0.2\epsilon=0.2. Total loss:

L(θ)=Lactor(θ)+cvLvalue(θ)L(\theta) = L_{\text{actor}}(\theta) + c_v L_{\text{value}}(\theta)

with cv=1c_v=1.

Log-probability estimation under flow matching: Convert deterministic probability-flow ODE into corresponding SDE by injecting controlled noise at each Euler denoising step, enabling analytic computation of Gaussian log-probability.

Pretraining Data

Large-scale heterogeneous mixture spanning five families:

Table 1: Pretraining data mixture composition.

Data SourceProportion (%)
Robot Manipulation Trajectories74.2
Human Egocentric Trajectories6.0
Navigation Trajectories7.5
Synthetic Simulation Trajectories (ours)3.7
General Vision-Language Data3.4
Spatial Grounding (2D)2.5
Autonomous Driving VQA2.4
Fine-Grained Embodied Action Caption0.2
Total100.0

Robot Manipulation Trajectories: Core of pretraining corpus (74.2%). Includes public datasets (RobotSet, Galaxea, AgiBot World, RoboCOIN, RoboMIND V1/V2, RDT-1B, DROID, BridgeData V2, RH20T, RT-1, BC-Z) and proprietary datasets. Covers tabletop, mobile, bimanual, dexterous control.

Embodiment-aware prompt conditioning: Each example prepended with robot-specific prompt.

Table 2: Representative robot embodiments in the pretraining corpus.

RobotArmsAction type
WidowXSingleΔ\Delta EEF + G
Google RobotSingleΔ\Delta EEF + G
Franka PandaSingle / DualΔ\Delta EEF + G; Abs Joint + G
ARX5DualΔ\Delta EEF + G
Fourier GR-1DualΔ\Delta EEF + G
Mobile ALOHADualΔ\Delta EEF + G; Abs Joint + G
AgiBot A2-DDualAbs Joint + G; Abs Joint + DH
Galaxea R1DualAbs Joint + G
AIRBOT MMK2DualAbs Joint + DH
TienKungDualAbs Joint + G; Abs Joint + DH
Real HumanDualΔ\Delta EEF (from MANO)

Action representation: Preserve each dataset's original action format. Normalize per dataset using quantile statistics:

a~d=2adq01kq99kq01k1\tilde{a}_d = 2 \cdot \frac{a_d - q^k_{01}}{q^k_{99} - q^k_{01}} - 1

clipped to [1,1][-1, 1].

Egocentric Human Data (6.0%): From Ego4D, EPIC-KITCHENS (processed by VITRA), EgoDex, EgoVerse, Xperience. Action representation: for each hand, wrist motion as SE(3) transformation (6 dimensions), hand articulation via PCA on 45-dimensional axis-angle joint pose, retaining first 10 principal components ("eigengrasps"). Total: 32 action dimensions per time step.

Synthetic Simulation Data (3.7%): Two components:

  1. Vision-language-action data: Generated via internal pipeline (ROBOINF). Contains 359,848 full successful trajectories including subtask segments.
  2. Language-action data: Text-only action dataset for semantic/behavioral pretraining. Six task template families (pick-and-place, pushing, pulling, rotation, rotation toward viewpoint, positional swapping). ~7.2M trajectories (~14,000 hours).

Navigation Data (7.5%): Instruction-following (4.3%), object-searching (2.3%), target-tracking (1.0%).

Vision-language Data (8.5% combined):

  • Fine-grained embodied action caption (0.2%): ~48,000 video-caption pairs with dense step-by-step descriptions.
  • Autonomous driving VQA (2.4%): Focuses on temporal scene understanding, surround-view spatial reasoning, language-grounded localization, planning-aware reasoning.
  • Spatial grounding (2.5%): 2D bounding box grounding data.
  • General vision-language data (3.4%): Curated mixture for robust visual perception and language grounding.

Empirical Validation / Results

Main Results

Table 4: Robot manipulation results across benchmarks: specialists vs. a single generalist.

MethodTypeLIBERORoboCasa-GR1Simpler-WidowXRoboTwin-EasyRoboTwin-Hard
π0\pi_0 (Black et al., 2024)Specialist94.465.958.4
StarVLA-OFT (Community, 2026)Specialist96.648.864.650.4
GR00T N1.6 (NVIDIA et al., 2025)Specialist97.249.963.247.6
π0.5\pi_{0.5} (Black et al., 2025)Specialist97.637.046.982.776.8
ABot-M0 (Yang et al., 2026)Specialist98.658.386.085.0
Being-H0.5 (Luo et al., 2026)Specialist97.653.3
Qwen-VLA-BaseGeneralist90.840.464.364.366.4
Qwen-VLA-InstructGeneralist97.956.773.786.187.2

Key Findings:

  • Single generalist (Qwen-VLA-Instruct) outperforms most specialists across benchmarks.
  • Pretraining provides strong foundation; instruction tuning yields substantial gains (+7.1% LIBERO, +16.3% RoboCasa-GR1, +9.4% Simpler-WidowX, +21.8% RoboTwin-Easy, +20.8% RoboTwin-Hard).

Real-World Manipulation Results (ALOHA)

Table 5: In-domain performance across short-horizon and long-horizon task categories.

ModelPick and PlaceTable CleaningBowl StackingBowl Pick & PlaceTowel FoldingFine-grained ManipulationAvg.
GR00T N1.6 (NVIDIA et al., 2025)30.838.553.819.219.210.328.6
π0.5\pi_{0.5} (Black et al., 2025)73.184.688.569.280.833.371.6
Qwen-VLA-aloha w/o pretrain30.853.861.564.150.030.848.5
Qwen-VLA-aloha w/ pretrain96.292.398.787.265.461.583.6

Table 6: OOD performance across generalization categories.

ModelColorInstancePositionBackgroundInstructionAvg.
GR00T N1.6 (NVIDIA et al., 2025)46.238.53.819.219.225.4
π0.5\pi_{0.5} (Black et al., 2025)57.761.519.226.942.341.5
Qwen-VLA-aloha w/o pretrain42.330.834.630.842.336.2
Qwen-VLA-aloha w/ pretrain88.576.953.880.884.676.9

Key Findings:

  • Pretraining provides critical foundation: Qwen-VLA-aloha w/ pretrain achieves 83.6% average success vs. 48.5% for w/o pretrain.
  • Strong OOD generalization: 76.9% average OOD success, outperforming π0.5\pi_{0.5} by 35.4 percentage points.

Navigation Results

Table 7: Comparison with open-source baselines on VLN-CE.

MethodR2R Val-UnseenRxR Val-Unseen
NE ↓ OS ↑ SR ↑ SPL ↑NE ↓ SR ↑ SPL ↑ nDTW ↑
NaVid (Zhang et al., 2024)5.7 49.2 41.9 36.55.7 45.7 38.2 –
Uni-NaVid (Zhang et al., 2025b)5.6 53.3 47.0 42.76.2 48.7