UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Summary (Overview)

Core Innovation: Introduces UniT, a unified latent action tokenizer that uses visual anchoring and cross-reconstruction to project heterogeneous human and humanoid actions into a shared, embodiment-agnostic discrete latent space, creating a "unified physical language."
Dual Application: UniT is successfully deployed in two paradigms: VLA-UniT for policy learning achieves superior data efficiency, out-of-distribution (OOD) generalization, and zero-shot task transfer; WM-UniT for world modeling enables effective human-to-humanoid dynamics transfer and improves action-conditioned video generation.
Key Mechanism: A tri-branch architecture with a rigorous cross-reconstruction objective forces visual and action features to reconstruct each other, anchoring kinematics to visual outcomes and filtering out irrelevant noise, thereby distilling pure physical intent.
Empirical Validation: Demonstrates state-of-the-art performance on the RoboCasa GR1 simulation benchmark, strong real-world deployment on the IRON-R01-1.11 humanoid, and effective cross-embodiment conditioning for world models, validated by t-SNE visualizations showing aligned representations.
Scalable Path: Provides a data-driven, scalable alternative to manual motion retargeting, enabling the distillation of vast human knowledge into general-purpose humanoid capabilities for both control and simulation.

Introduction and Theoretical Foundation

Scaling foundation models for humanoids is fundamentally bottlenecked by the scarcity of high-quality robotic data. While massive, low-cost egocentric human motion data offers a scalable alternative rich in physical priors, leveraging it requires bridging a major cross-embodiment gap caused by biomechanical and hardware differences (heterogeneous state-action spaces, mismatched degrees of freedom). Traditional methods like motion retargeting are labor-intensive and unscalable.

Existing approaches for unified representations have critical limitations:

Action-Only Methods: Rely on proprioceptive reconstruction and suffer from severe distribution shifts between humans and robots due to lack of external grounding.
Vision-Only Methods: Infer intent directly from pixels but entangle with low-level appearance confounders (textures, lighting) and miss fine-grained motor details.
Decoupled Vision-Action Methods: Employ independent tokenizers for each modality, resulting in disjoint vocabularies without deep representational unification.

UniT's Theoretical Insight: While human and humanoid kinematics differ, the physical outcomes of their intents share a consistent visual representation. Therefore, visual observations can serve as a universal anchor to ground and align disparate kinematic spaces. UniT operationalizes this via cross-reconstruction, acting as a cross-modal information bottleneck to distill the underlying, embodiment-agnostic physical intent.

Methodology

3.1 Overview

The goal is to establish a unified physical language bridging human and humanoid action spaces. Given demonstrations as sequences $(o_t, s_t, a_t)$ , UniT learns a discrete latent action representation. This token representation is then deployed in:

VLA-UniT: Policy learning by predicting future action chunks via UniT token prediction.
WM-UniT: World modeling by predicting future visual observations using UniT tokens as a universal conditioning signal.

3.2 UniT: Unified Latent Action Tokenizer via Vision Anchoring

UniT uses a tri-branch architecture (visual, action, fusion) with cross-reconstruction (Fig. 3).

Tri-Branch Encoding:

Visual branch $E_v$ : Inverse Dynamics Model (IDM). Takes frozen DINOv2 features of observation pair $(o_t, o_{t+k})$ to produce a latent representation of the physical transition.
Action branch $E_a$ : Encodes current state $s_t$ and action chunk $a_{t:t+k}$ . Raw actions are padded and projected by embodiment-specific MLPs.
Fusion branch $E_m$ : Takes features from vision and action branches to produce a fused visuo-motor latent.

Shared Discrete Quantization: Continuous latents from all three branches are quantized via a shared Residual Quantization (RQ-VAE) codebook $\mathcal{C}$ :

\hat{z}_i = \text{RQ}(z_i; \mathcal{C}), \quad i \in \{v, a, m\}.

Cross-Reconstruction: Every quantized token $\hat{z}_i$ is decoded by both a shared visual decoder $D_v$ and an embodiment-specific action decoder $D_a$ :

\hat{f}^{(i)}_{t+k} = D_v(\hat{z}_i, f_t), \quad \hat{a}^{(i)}_{t:t+k} = D_a(\hat{z}_i, s_t),

where $f_t$ is the DINOv2 feature of $o_t$ . This forces the token to capture relative physical change.

Training Objective: The total loss aggregates cross-reconstruction and quantization terms:

\mathcal{L} = \sum_{i \in \{v,a,m\}} \left[ \lambda_v \mathcal{L}_{\text{cos}}(\hat{f}^{(i)}_{t+k}, f_{t+k}) + \lambda_a \mathcal{L}_{\text{act}}(\hat{a}^{(i)}_{t:t+k}, a_{t:t+k}) \right] + \mathcal{L}_{\text{RQ}}.

3.3 VLA-UniT: Cross Embodiment Policy Learning via UniT

Built upon the GR00T framework with Qwen2.5-VL backbone, policy learning is decomposed into:

UniT Token Prediction: The VLM, conditioned on $(o_t, \ell)$ , predicts UniT discrete codes $c_t$ via learnable queries $q_t$ . $\hat{p}_t = f_{\text{VLM}}(o_t, \ell, q_t), \quad \mathcal{L}_{\text{token}} = \text{CE}(\hat{p}_t, c_t).$
Flow Matching Action Generation: A lightweight flow head generates embodiment-specific actions $A_t = [a_t, ..., a_{t+H-1}]$ conditioned on VLM features $x_t$ : $\mathcal{L}_{\text{fm}} = \mathbb{E}_{\tau,\epsilon} \left[ \| V_\theta(A^\tau_t | x_t, \text{Enc}(o_t), \tau) - (A_t - \epsilon) \|^2_2 \right].$

The total objective is $\mathcal{L}_{\text{VLA}} = \mathcal{L}_{\text{token}} + \lambda_{\text{fm}} \mathcal{L}_{\text{fm}}$ .

3.4 WM-UniT: Cross Embodiment World Modeling via UniT

Built upon Cosmos Predict 2.5, it uses UniT action-branch features as a unified conditioning interface instead of raw actions. Given state $s_t$ and action chunk $a_{t:t+k}$ , the continuous pre-quantization feature $\tilde{z}^a_t = E_a(s_t, a_{t:t+k})$ is projected and injected via cross-attention. The generation model is trained with flow matching on latent future frames $X$ :

\mathcal{L}_{\text{WM}} = \mathbb{E}_{\tau,\epsilon} \left[ \| V_\phi(X^\tau_t | o_t, \text{MLP}(\tilde{z}^a_t), \tau) - (X_t - \epsilon) \|^2_2 \right].

Empirical Validation / Results

5.1 Unified Representation: Alignment and Robustness

Cross-Embodiment Token Alignment: t-SNE visualizations (Fig. 7a) show raw human and humanoid actions form separated clusters, while UniT token embeddings become highly overlapping, confirming successful projection into a shared manifold.
Robustness to Action Noise: When Gaussian noise is injected into action trajectories, UniT demonstrates superior denoising capability compared to action-only tokenizers (FAST, Action Tokenizer). At noise level $\sigma = 0.2$ , UniT's reconstruction degrades by only 1.7×, versus 2.7× for action-only and 10.7× for FAST (Fig. 8).
Downstream Representation Alignment: t-SNE of internal features from VLA and WM models shows that UniT-based models produce highly overlapping cross-embodiment representations, whereas vanilla baselines maintain separated clusters (Fig. 7b,c).

5.2 Policy Learning

5.2.1 Benchmark Performance and Data Efficiency

Overall Performance (RoboCasa GR1, Full Data): VLA-UniT achieves a 66.7% overall success rate, outperforming all baselines (Fig. 9). It surpasses the GR00T baseline (47.8%) by 18.9%.
Data Efficiency: With only 10% of the training data (few-shot), VLA-UniT achieves 45.5% success, approaching the GR00T baseline trained on full data (47.8%), demonstrating a ~10× reduction in data requirements (Fig. 10 left).

5.2.2 Human-to-Humanoid Transfer (Simulation)

Co-training on EgoDex human data and few-shot robot data improves performance:

In-domain average increases from 45.5% to 50.0%.
OOD generalization improves across Unseen Appearance, Combinations, and Object Types, with OOD average rising from 34.7% to 38.5% (Fig. 10 right).

5.2.3 Real-World Generalization (IRON-R01-1.11)

In-Domain Performance: VLA-UniT with human co-training achieves 78% (Pick & Place) and 75% (Pouring) success, significantly outperforming the GR00T baseline (30%, 5%) (Fig. 11 left).
OOD Generalization: Across five OOD axes (Geometry, Distractor, Target, Background, Combinational), VLA-UniT with human data consistently achieves the strongest performance, with gains up to ~40% in some categories (Fig. 11 right).
Zero-Shot Task Transfer: On an unseen stacking task, VLA-UniT with human co-training achieves 60% success, transferring both task logic and emergent upper-body coordination (waist rotation, head turning) from human videos (Fig. 12).

5.3 World Modeling

5.3.1 Controllable Generation

Table 1: Controllable generation results on DROID and co-training datasets.

Dataset	Method	PSNR ↑	SSIM ↑	LPIPS ↓	FVD ↓	EPE ↓
DROID	Raw Action	21.02	0.820	0.097	76.38	0.2662
	WM-Action	20.86	0.819	0.102	80.30	0.2593
	WM-UniT	21.32	0.823	0.095	76.44	0.2588
EgoDex	Raw Action	24.84	0.800	0.164	171.37	0.706
(Co-train)	WM-UniT	28.06	0.858	0.086	130.87	0.519
RoboCasa-GR1	Raw Action	13.45	0.590	0.259	237.13	0.558
(Co-train)	WM-UniT	17.66	0.718	0.142	166.50	0.453

WM-UniT achieves the best controllability (lowest EPE) and improves reconstruction metrics.

5.3.2 Human-Humanoid Transfer

Co-Training: WM-UniT outperforms Raw Action conditioning on both human and humanoid subsets when co-trained (Table 1).
Pre-Training: Pre-training on human data (EgoDex) then fine-tuning on robot data (RoboCasa) improves all metrics, especially controllability (Table 2).
Cross-Embodiment Conditioning: Qualitative (Fig. 13, 14) and quantitative evaluations (Table 3) show that UniT tokens enable more faithful action transfer between embodiments (human→robot, robot→human), preserving fine-grained semantics, magnitude sensitivity, and temporal coherence better than raw actions.

Table 3: Cross-embodiment conditioning consistency (MLLM evaluation, 1-5 scale).

Method	Semantic ↑	Temporal ↑	Geometric ↑	Overall ↑
Robot-to-Human
Raw Action	2.96	3.12	2.74	2.92
WM-UniT	3.91	3.98	3.66	3.84
Human-to-Robot
Raw Action	2.98	3.16	2.72	2.95
WM-UniT	3.28	3.43	3.09	3.27

5.4 Tokenizer Design Ablation

Ablation studies on RoboCasa GR1 (with human co-training) validate UniT's design principles (Fig. 15):

Vision-Action Synergy: VLA-UniT (OOD avg. 49.9%) outperforms both single-modality variants: VLA-Vision (45.2%) and VLA-Action (42.1%).
Cross-Reconstruction Necessity: VLA-UniT w/o Cross-Recon performs worst (30.3%), showing multi-modal input alone is insufficient without explicit alignment.
Bidirectional Advantage: VLA-UniT (66.8% in-domain) outperforms VLA-Villa (63.1%) which uses unidirectional vision-to-action reconstruction.

Theoretical and Practical Implications

Theoretical: Provides a principled, data-driven framework for cross-embodiment alignment based on the insight of visual anchoring. The cross-reconstruction mechanism offers a general method for distilling modality-invariant concepts.
Practical for Robotics:
- Scalable Human Knowledge Transfer: Enables effective use of massive, low-cost human data for humanoid policy learning and world modeling, bypassing the need for manual, case-by-case motion retargeting.
- Improved Data Efficiency & Generalization: The structured, shared latent space allows policies to learn intent more efficiently from limited data and generalize better to OOD scenarios.
- Unified Interface for Embodied AI: The same UniT token representation serves as a stable interface for both policy learning (target) and world modeling (condition), enabling potential closed-loop co-evolution (e.g., planning via latent space search).
Broader AI Impact: The visual-anchored approach opens a path to leverage unlabeled internet video as a source of physical priors. The framework's scalability suggests potential for learning complex coordination and dexterous control directly from diverse human demonstrations.

Conclusion

UniT establishes a unified physical language via visual-anchored cross-reconstruction, effectively bridging the human-humanoid chasm. It demonstrates significant gains in policy learning (data efficiency, OOD generalization, zero-shot transfer) and world modeling (improved controllability, cross-embodiment dynamics transfer). The core design principles—vision-action synergy and bidirectional cross-reconstruction—are empirically validated as essential. UniT provides a scalable, data-driven path to distill vast human knowledge into general-purpose humanoid capabilities, with promising future directions in leveraging unlabeled video and enabling closed-loop policy-world model co-evolution.