# UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

> UniT introduces a unified latent action tokenizer that uses visual anchoring and cross-reconstruction to align human and humanoid actions into a shared space, enabling superior policy transfer and world modeling.

- **Source:** [arXiv](https://arxiv.org/abs/2604.19734)
- **Published:** 2026-04-25
- **Permalink:** https://picx.dev/p/SabpM3
- **Whiteboard:** https://picx.dev/p/SabpM3/image

## Summary

# UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

## Summary (Overview)
*   **Core Innovation:** Introduces UniT, a unified latent action tokenizer that uses visual anchoring and cross-reconstruction to project heterogeneous human and humanoid actions into a shared, embodiment-agnostic discrete latent space, creating a "unified physical language."
*   **Dual Application:** UniT is successfully deployed in two paradigms: **VLA-UniT** for policy learning achieves superior data efficiency, out-of-distribution (OOD) generalization, and zero-shot task transfer; **WM-UniT** for world modeling enables effective human-to-humanoid dynamics transfer and improves action-conditioned video generation.
*   **Key Mechanism:** A tri-branch architecture with a rigorous **cross-reconstruction objective** forces visual and action features to reconstruct each other, anchoring kinematics to visual outcomes and filtering out irrelevant noise, thereby distilling pure physical intent.
*   **Empirical Validation:** Demonstrates state-of-the-art performance on the RoboCasa GR1 simulation benchmark, strong real-world deployment on the IRON-R01-1.11 humanoid, and effective cross-embodiment conditioning for world models, validated by t-SNE visualizations showing aligned representations.
*   **Scalable Path:** Provides a data-driven, scalable alternative to manual motion retargeting, enabling the distillation of vast human knowledge into general-purpose humanoid capabilities for both control and simulation.

## Introduction and Theoretical Foundation
Scaling foundation models for humanoids is fundamentally bottlenecked by the scarcity of high-quality robotic data. While massive, low-cost egocentric human motion data offers a scalable alternative rich in physical priors, leveraging it requires bridging a major **cross-embodiment gap** caused by biomechanical and hardware differences (heterogeneous state-action spaces, mismatched degrees of freedom). Traditional methods like motion retargeting are labor-intensive and unscalable.

Existing approaches for unified representations have critical limitations:
*   **Action-Only Methods:** Rely on proprioceptive reconstruction and suffer from severe distribution shifts between humans and robots due to lack of external grounding.
*   **Vision-Only Methods:** Infer intent directly from pixels but entangle with low-level appearance confounders (textures, lighting) and miss fine-grained motor details.
*   **Decoupled Vision-Action Methods:** Employ independent tokenizers for each modality, resulting in disjoint vocabularies without deep representational unification.

**UniT's Theoretical Insight:** While human and humanoid kinematics differ, the **physical outcomes of their intents share a consistent visual representation**. Therefore, visual observations can serve as a **universal anchor** to ground and align disparate kinematic spaces. UniT operationalizes this via **cross-reconstruction**, acting as a cross-modal information bottleneck to distill the underlying, embodiment-agnostic physical intent.

## Methodology

### 3.1 Overview
The goal is to establish a unified physical language bridging human and humanoid action spaces. Given demonstrations as sequences $(o_t, s_t, a_t)$, UniT learns a discrete latent action representation. This token representation is then deployed in:
1.  **VLA-UniT:** Policy learning by predicting future action chunks via UniT token prediction.
2.  **WM-UniT:** World modeling by predicting future visual observations using UniT tokens as a universal conditioning signal.

### 3.2 UniT: Unified Latent Action Tokenizer via Vision Anchoring
UniT uses a **tri-branch architecture** (visual, action, fusion) with **cross-reconstruction** (Fig. 3).

**Tri-Branch Encoding:**
*   **Visual branch $E_v$:** Inverse Dynamics Model (IDM). Takes frozen DINOv2 features of observation pair $(o_t, o_{t+k})$ to produce a latent representation of the physical transition.
*   **Action branch $E_a$:** Encodes current state $s_t$ and action chunk $a_{t:t+k}$. Raw actions are padded and projected by embodiment-specific MLPs.
*   **Fusion branch $E_m$:** Takes features from vision and action branches to produce a fused visuo-motor latent.

**Shared Discrete Quantization:** Continuous latents from all three branches are quantized via a shared Residual Quantization (RQ-VAE) codebook $\mathcal{C}$:
$$
\hat{z}_i = \text{RQ}(z_i; \mathcal{C}), \quad i \in \{v, a, m\}.
$$

**Cross-Reconstruction:** Every quantized token $\hat{z}_i$ is decoded by both a shared visual decoder $D_v$ and an embodiment-specific action decoder $D_a$:
$$
\hat{f}^{(i)}_{t+k} = D_v(\hat{z}_i, f_t), \quad \hat{a}^{(i)}_{t:t+k} = D_a(\hat{z}_i, s_t),
$$
where $f_t$ is the DINOv2 feature of $o_t$. This forces the token to capture relative physical change.

**Training Objective:** The total loss aggregates cross-reconstruction and quantization terms:
$$
\mathcal{L} = \sum_{i \in \{v,a,m\}} \left[ \lambda_v \mathcal{L}_{\text{cos}}(\hat{f}^{(i)}_{t+k}, f_{t+k}) + \lambda_a \mathcal{L}_{\text{act}}(\hat{a}^{(i)}_{t:t+k}, a_{t:t+k}) \right] + \mathcal{L}_{\text{RQ}}.
$$

### 3.3 VLA-UniT: Cross Embodiment Policy Learning via UniT
Built upon the GR00T framework with Qwen2.5-VL backbone, policy learning is decomposed into:
1.  **UniT Token Prediction:** The VLM, conditioned on $(o_t, \ell)$, predicts UniT discrete codes $c_t$ via learnable queries $q_t$.
    $$
    \hat{p}_t = f_{\text{VLM}}(o_t, \ell, q_t), \quad \mathcal{L}_{\text{token}} = \text{CE}(\hat{p}_t, c_t).
    $$
2.  **Flow Matching Action Generation:** A lightweight flow head generates embodiment-specific actions $A_t = [a_t, ..., a_{t+H-1}]$ conditioned on VLM features $x_t$:
    $$
    \mathcal{L}_{\text{fm}} = \mathbb{E}_{\tau,\epsilon} \left[ \| V_\theta(A^\tau_t | x_t, \text{Enc}(o_t), \tau) - (A_t - \epsilon) \|^2_2 \right].
    $$
The total objective is $\mathcal{L}_{\text{VLA}} = \mathcal{L}_{\text{token}} + \lambda_{\text{fm}} \mathcal{L}_{\text{fm}}$.

### 3.4 WM-UniT: Cross Embodiment World Modeling via UniT
Built upon Cosmos Predict 2.5, it uses **UniT action-branch features** as a unified conditioning interface instead of raw actions. Given state $s_t$ and action chunk $a_{t:t+k}$, the continuous pre-quantization feature $\tilde{z}^a_t = E_a(s_t, a_{t:t+k})$ is projected and injected via cross-attention. The generation model is trained with flow matching on latent future frames $X$:
$$
\mathcal{L}_{\text{WM}} = \mathbb{E}_{\tau,\epsilon} \left[ \| V_\phi(X^\tau_t | o_t, \text{MLP}(\tilde{z}^a_t), \tau) - (X_t - \epsilon) \|^2_2 \right].
$$

## Empirical Validation / Results

### 5.1 Unified Representation: Alignment and Robustness
*   **Cross-Embodiment Token Alignment:** t-SNE visualizations (Fig. 7a) show raw human and humanoid actions form separated clusters, while **UniT token embeddings become highly overlapping**, confirming successful projection into a shared manifold.
*   **Robustness to Action Noise:** When Gaussian noise is injected into action trajectories, UniT demonstrates superior denoising capability compared to action-only tokenizers (FAST, Action Tokenizer). At noise level $\sigma = 0.2$, UniT's reconstruction degrades by only **1.7×**, versus **2.7×** for action-only and **10.7×** for FAST (Fig. 8).
*   **Downstream Representation Alignment:** t-SNE of internal features from VLA and WM models shows that **UniT-based models produce highly overlapping cross-embodiment representations**, whereas vanilla baselines maintain separated clusters (Fig. 7b,c).

### 5.2 Policy Learning
#### 5.2.1 Benchmark Performance and Data Efficiency
*   **Overall Performance (RoboCasa GR1, Full Data):** VLA-UniT achieves a **66.7%** overall success rate, outperforming all baselines (Fig. 9). It surpasses the GR00T baseline (47.8%) by **18.9%**.
*   **Data Efficiency:** With only 10% of the training data (few-shot), VLA-UniT achieves **45.5%** success, approaching the GR00T baseline trained on full data (47.8%), demonstrating a **~10×** reduction in data requirements (Fig. 10 left).

#### 5.2.2 Human-to-Humanoid Transfer (Simulation)
Co-training on EgoDex human data and few-shot robot data improves performance:
*   In-domain average increases from **45.5% to 50.0%**.
*   OOD generalization improves across Unseen Appearance, Combinations, and Object Types, with OOD average rising from **34.7% to 38.5%** (Fig. 10 right).

#### 5.2.3 Real-World Generalization (IRON-R01-1.11)
*   **In-Domain Performance:** VLA-UniT with human co-training achieves **78%** (Pick & Place) and **75%** (Pouring) success, significantly outperforming the GR00T baseline (30%, 5%) (Fig. 11 left).
*   **OOD Generalization:** Across five OOD axes (Geometry, Distractor, Target, Background, Combinational), VLA-UniT with human data consistently achieves the strongest performance, with gains up to **~40%** in some categories (Fig. 11 right).
*   **Zero-Shot Task Transfer:** On an unseen stacking task, VLA-UniT with human co-training achieves **60%** success, transferring both task logic and emergent upper-body coordination (waist rotation, head turning) from human videos (Fig. 12).

### 5.3 World Modeling
#### 5.3.1 Controllable Generation
**Table 1: Controllable generation results on DROID and co-training datasets.**
| Dataset | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FVD ↓ | EPE ↓ |
| :--- | :--- | :---: | :---: | :---: | :---: | :---: |
| **DROID** | Raw Action | 21.02 | 0.820 | 0.097 | 76.38 | 0.2662 |
| | WM-Action | 20.86 | 0.819 | 0.102 | 80.30 | 0.2593 |
| | **WM-UniT** | **21.32** | **0.823** | **0.095** | 76.44 | **0.2588** |
| **EgoDex** | Raw Action | 24.84 | 0.800 | 0.164 | 171.37 | 0.706 |
| (Co-train) | **WM-UniT** | **28.06** | **0.858** | **0.086** | **130.87** | **0.519** |
| **RoboCasa-GR1** | Raw Action | 13.45 | 0.590 | 0.259 | 237.13 | 0.558 |
| (Co-train) | **WM-UniT** | **17.66** | **0.718** | **0.142** | **166.50** | **0.453** |

WM-UniT achieves the best controllability (lowest EPE) and improves reconstruction metrics.

#### 5.3.2 Human-Humanoid Transfer
*   **Co-Training:** WM-UniT outperforms Raw Action conditioning on both human and humanoid subsets when co-trained (Table 1).
*   **Pre-Training:** Pre-training on human data (EgoDex) then fine-tuning on robot data (RoboCasa) improves all metrics, especially controllability (Table 2).
*   **Cross-Embodiment Conditioning:** Qualitative (Fig. 13, 14) and quantitative evaluations (Table 3) show that UniT tokens enable more faithful action transfer between embodiments (human→robot, robot→human), preserving fine-grained semantics, magnitude sensitivity, and temporal coherence better than raw actions.

**Table 3: Cross-embodiment conditioning consistency (MLLM evaluation, 1-5 scale).**
| Method | Semantic ↑ | Temporal ↑ | Geometric ↑ | Overall ↑ |
| :--- | :---: | :---: | :---: | :---: |
| **Robot-to-Human** | | | | |
| Raw Action | 2.96 | 3.12 | 2.74 | 2.92 |
| **WM-UniT** | **3.91** | **3.98** | **3.66** | **3.84** |
| **Human-to-Robot** | | | | |
| Raw Action | 2.98 | 3.16 | 2.72 | 2.95 |
| **WM-UniT** | **3.28** | **3.43** | **3.09** | **3.27** |

### 5.4 Tokenizer Design Ablation
Ablation studies on RoboCasa GR1 (with human co-training) validate UniT's design principles (Fig. 15):
*   **Vision-Action Synergy:** VLA-UniT (OOD avg. **49.9%**) outperforms both single-modality variants: VLA-Vision (45.2%) and VLA-Action (42.1%).
*   **Cross-Reconstruction Necessity:** VLA-UniT w/o Cross-Recon performs worst (**30.3%**), showing multi-modal input alone is insufficient without explicit alignment.
*   **Bidirectional Advantage:** VLA-UniT (**66.8%** in-domain) outperforms VLA-Villa (63.1%) which uses unidirectional vision-to-action reconstruction.

## Theoretical and Practical Implications
*   **Theoretical:** Provides a principled, data-driven framework for cross-embodiment alignment based on the insight of **visual anchoring**. The cross-reconstruction mechanism offers a general method for distilling modality-invariant concepts.
*   **Practical for Robotics:**
    *   **Scalable Human Knowledge Transfer:** Enables effective use of massive, low-cost human data for humanoid policy learning and world modeling, bypassing the need for manual, case-by-case motion retargeting.
    *   **Improved Data Efficiency & Generalization:** The structured, shared latent space allows policies to learn intent more efficiently from limited data and generalize better to OOD scenarios.
    *   **Unified Interface for Embodied AI:** The same UniT token representation serves as a stable interface for both **policy learning (target)** and **world modeling (condition)**, enabling potential closed-loop co-evolution (e.g., planning via latent space search).
*   **Broader AI Impact:** The visual-anchored approach opens a path to leverage **unlabeled internet video** as a source of physical priors. The framework's scalability suggests potential for learning complex coordination and dexterous control directly from diverse human demonstrations.

## Conclusion
UniT establishes a **unified physical language** via visual-anchored cross-reconstruction, effectively bridging the human-humanoid chasm. It demonstrates significant gains in policy learning (data efficiency, OOD generalization, zero-shot transfer) and world modeling (improved controllability, cross-embodiment dynamics transfer). The core design principles—**vision-action synergy** and **bidirectional cross-reconstruction**—are empirically validated as essential. UniT provides a scalable, data-driven path to distill vast human knowledge into general-purpose humanoid capabilities, with promising future directions in leveraging unlabeled video and enabling closed-loop policy-world model co-evolution.

---

_Markdown view of https://picx.dev/p/SabpM3, served by PicX — AI-generated visual whiteboard summaries of research papers._