Summary (Overview)

  • Massive data scaling: Humanoid-GPT is pre-trained on a curated 2B-frame motion corpus, over 200× larger than prior humanoid tracking datasets, enabling unprecedented zero-shot generalization.
  • GPT-style architecture: Adopts a causal Transformer with temporal attention, aligning with online tracking constraints and scaling cleanly with data and model size, unlike MLPs that saturate.
  • Harmonic Motion Embedding (HME): A novel representation learning tool that measures and organizes motion diversity, enabling diversity-aware balanced sampling during training – key for generalization.
  • Scaling laws established: The first systematic scaling law for humanoid motion tracking, showing predictable improvements with data scale and model capacity (80M-parameter models outperform 22M-parameter ones).
  • Zero-shot real-world transfer: Achieves real-time whole-body tracking on the Unitree-G1 humanoid for unseen dance motions and teleoperation tasks without any fine-tuning, with inference latency under 1.5ms.

Introduction and Theoretical Foundation

Embodied AGI requires robust whole-body control under unseen tasks, styles, and environments. In language and vision, scaling (larger data, larger models) is the proven path to generalization. However, humanoid motion tracking has not followed this trajectory: current trackers are shallow MLPs trained on small corpora (≈7.2M frames), leading to an agility–generalization trade-off. Systems like BeyondMimic and ASAP track agile motions well but do not generalize; TWIST and UniTracker generalize modestly but struggle with dynamic actions.

The paper argues this trade-off is not fundamental but a symptom of insufficient scale and mismatched training design. Three decisive questions emerge:

  1. Data: What data to train on, and how to process large, noisy data?
  2. Model structure: What architecture matches online tracking constraints and scales with data?
  3. Training recipe: What recipe stays stable when data grows from millions to billions of frames?

Humanoid-GPT answers these by constructing a diverse 2B-frame corpus, adopting a GPT-style causal Transformer, and introducing Harmonic Motion Embedding (HME) for diversity-aware sampling.

Methodology

Data Curation

Aggregates all major mocap datasets: AMASS, LAFAN1, MotionMillion, PHUMA, plus in-house recordings. Filters out sequences with object interactions (sitting, swimming) and retargets all motions to the 29-DoF Unitree-G1 joint space. Applies time-warping augmentation (×5 size) for temporal variability. Yields a 2B-frame corpus with high fidelity and diversity.

Harmonic Motion Embedding (HME)

To balance coverage and training efficiency, motions are clustered. A Periodic Autoencoder extracts per-joint periodic amplitudes and frequencies. For each sequence, the mean and standard deviation of these features form the HME vector. K-Means clustering on HME embeddings yields ≈300 clusters (1k–2k sequences each) with strong intra-cluster consistency and broad coverage.

Motion Experts

For each cluster, a PPO-based policy is trained to track sequences. The policy maps reference joints and proprioceptive states to motor actions. The keypoint-level reward is:

Rkpt(t)=Rpos(t)+Rrot(t)+Rvel(t)+Rpenal(t),Rpos(t)=kKwkexp ⁣(αposek,tpos1),Rrot(t)=kKwkexp(αrotθk,t),Rvel(t)=kKwkexp ⁣(αvelek,tvel1).\begin{aligned} R_{\text{kpt}}(t) &= R_{\text{pos}}(t) + R_{\text{rot}}(t) + R_{\text{vel}}(t) + R_{\text{penal}}(t),\\ R_{\text{pos}}(t) &= \sum_{k \in \mathcal{K}} w_k \exp\!\left(-\alpha_{\text{pos}} \| e^{\text{pos}}_{k,t} \|_1\right),\\ R_{\text{rot}}(t) &= \sum_{k \in \mathcal{K}} w_k \exp(-\alpha_{\text{rot}} \theta_{k,t}),\\ R_{\text{vel}}(t) &= \sum_{k \in \mathcal{K}} w_k \exp\!\left(-\alpha_{\text{vel}} \| e^{\text{vel}}_{k,t} \|_1\right). \end{aligned}

where (e^{\text{pos}}{k,t}) and (e^{\text{vel}}{k,t}) are position/velocity residuals, (\theta_{k,t}) is rotation error, and (w_k) are keypoint weights.

Distillation into Humanoid-GPT

A causal Transformer (GPT-style) is trained via DAgger to distill all expert behaviors. At each timestep (t), the input token (e_t) concatenates proprioceptive state (s_t) and reference pose (q_t^{\text{ref}}). A sequence of length (H) is fed through the Transformer with causal masking. Actions at all output positions are supervised by expert actions in parallel:

a^tH+1:t=tiTconcatk[H+1,0]ti(stkpriv,gtk),L=SmoothL1Loss(Gθ(etH+1:t),a^tH+1:t).\hat{a}_{t-H+1:t} = \bigcup_{t_i \in \mathcal{T}} \text{concat}_{k \in [-H+1,0]}\, t_i(s^{\text{priv}}_{t-k}, g_{t-k}),\\ \mathcal{L} = \text{SmoothL1Loss}\big(G_\theta(e_{t-H+1:t}), \hat{a}_{t-H+1:t}\big).

During inference, a queue of history tokens is maintained; the last output is the control target.

Empirical Validation / Results

Data Diversity Analysis

Three diversity indicators are computed from HME embeddings:

gstd=exp ⁣(1Dj=1Dlogσj),log-volume=12logdet(Σ+ϵI).\text{gstd} = \exp\!\left(\frac{1}{D}\sum_{j=1}^D \log\sigma_j\right),\quad \text{log-volume} = \frac{1}{2}\log\det(\Sigma + \epsilon I).

The curated 2B dataset shows ~4–5× increase in log-volume over AMASS (Fig. 3), demonstrating broader latent coverage.

Simulation Results (Table 2)

Backbone#Train Tokens#Params (M)SR ↑MPJPE ↓MPJVE ↓RootVelErr ↓MPKPE ↓
MLP (3-layer)2M0.2576.890.11910.60810.2304100.49
TCN (8-layer)2M0.6581.480.08850.57160.226679.75
Humanoid-GPT-S2M5.783.260.08530.54920.204962.65
Humanoid-GPT-S20M5.786.020.08020.52100.186846.49
Humanoid-GPT-B200M22.188.270.07930.50760.182044.78
Humanoid-GPT-B2B22.190.430.07680.48910.175641.49
Humanoid-GPT-L2B80.492.580.07350.48200.178540.99

Transformers continue to improve with scaling, while MLPs saturate and can even overfit on small data.

Real-world Evaluation (Table 3)

Outperforms baselines (GMT, TWIST, Any2Track) on four unseen dance motions (e.g., "Can Do", "Gokuraku Joudo") without fine-tuning. MPJPE and MPJVE closely match simulation results.

Scaling Laws (Section 6)

  • Data scaling: MPJPE decreases monotonically from 2M to 2B tokens (Fig. 7).
  • Model scaling: Transformers improve steadily; MLPs saturate early (Fig. 8).

Engineering Optimization

Deployment uses ONNX + TensorRT + C++ pipeline. End-to-end inference latency: <1.5ms on RTX 4090 (≈5× faster than TWIST, Fig. 5).

Theoretical and Practical Implications

  • Scaling works for humanoid control: The paper provides the first systematic evidence that scaling data and model capacity yields predictable improvements in agility and zero-shot generalization, challenging the notion of an inherent trade-off.
  • HME enables balanced diversity: Demonstrates that diversity alone is insufficient – balance (distribution-aware sampling) is necessary to avoid overfitting frequent motion modes.
  • GPT-style structure aligns with online control: Causal attention respects deployment constraints (no future observations) and naturally exploits parallel supervision (DAgger).
  • Real-world feasibility: Careful engineering (ONNX, TensorRT) shows that large Transformer models (80M params) can run in real time on moderate GPU hardware, enabling practical deployment.
  • Foundation for generalist whole-body control: The approach is a step toward embodied foundation models that can execute arbitrary motions without task-specific tuning.

Conclusion

Humanoid-GPT is a zero-shot humanoid motion tracker built by scaling data to 2B frames and model capacity to 80M parameters using a GPT-style causal Transformer. Key innovations include:

  • A large-scale curated and retargeted motion corpus.
  • HME for diversity-aware clustering.
  • Distillation of hundreds of PPO experts into a single Transformer via DAgger.
  • Demonstration of scaling laws for humanoid tracking.

Experiments show robust zero-shot transfer from simulation to real hardware, tracking highly dynamic motions (dance, teleoperation) without fine-tuning.

Future work includes:

  • Incorporating richer modalities (contacts, vision, language).
  • Extending to interactive / multi-agent scenarios.
  • Coupling with longer-horizon planning or VLA-style instruction for general-purpose embodied foundation models.

Related papers