Summary of "HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents"
Summary (Overview)
- Model Family: Introduces HY-Embodied-0.5, a family of Vision-Language Models (VLMs) specifically designed for real-world embodied agents, comprising an efficient MoT-2B model (2B activated params) for edge deployment and a powerful MoE-A32B model (32B activated params) for complex reasoning.
- Core Innovations: Proposes a Mixture-of-Transformers (MoT) architecture for modality-adaptive computing, visual latent tokens to bridge vision and language, and an iterative, self-evolving post-training paradigm combining reinforcement learning (RL) and rejection sampling fine-tuning (RFT) to enhance reasoning.
- Training Strategy: Employs a comprehensive data pipeline with over 100M samples spanning visual perception, spatial reasoning, and embodied tasks, followed by on-policy distillation to transfer capabilities from the large to the small model.
- Strong Performance: The MoT-2B model outperforms similarly-sized state-of-the-art models on 16 out of 22 benchmarks. The MoE-A32B model achieves an average score of 67.0%, surpassing frontier models like Gemini 3.0 Pro (63.6%).
- Real-World Validation: The foundation model is successfully adapted into a Vision-Language-Action (VLA) model for robot control, achieving compelling success rates (e.g., 75%-85%) in real-world physical manipulation tasks like packing, stacking, and hanging.
Introduction and Theoretical Foundation
The research is motivated by the need to bridge the gap between general-purpose Vision-Language Models (VLMs) and the specific demands of embodied agents operating in the physical world. While LLMs and VLMs have advanced digital agents, two key deficiencies hinder their application in physical environments:
- Fine-Grained Visual Perception: Existing VLMs lack the precise, granular visual perception required for physical grounding and informed action decisions.
- Embodied Prediction, Interaction, and Planning: Mainstream VLMs, trained on static web data, are not optimized for the dynamic, action-oriented capabilities needed for prediction, interaction, and planning in the physical world.
The theoretical foundation is built on the VLM paradigm, positing that embodied VLMs uniquely bridge LLM agents and physical agents, leveraging open-world knowledge for real-world tasks. The goal is to systematically enhance capabilities in both visual perception and embodied reasoning through innovations in architecture, data, and training.
Methodology
Model Architecture
The architecture is based on a vision encoder (HY-ViT 2.0) and a large language model (Hunyuan-1.8B), with key enhancements:
- HY-ViT 2.0: An efficient native-resolution Vision Transformer (400M params) trained via distillation for accurate, robust perception on edge devices.
- Mixture-of-Transformers (MoT): Introduces modality-specific parameters (QKV, FFN) for visual and textual tokens. Visual tokens use bidirectional (full) attention, while text tokens use causal attention. This improves visual modeling without degrading language capabilities.
- Visual Latent Tokens: Learnable tokens appended to each visual element, supervised by global features from a teacher ViT to improve perceptual representation and bridge modalities.
- Training Objectives: During pre-training, the model is optimized with a combined loss:
where:
- is the next-visual-code prediction loss.
- aligns the visual latent token with global image semantics.
Training Pipeline
The training is a multi-stage process (see Fig. 5):
- Pre-training & Mid-training: Trained on over 600B tokens of mixed general, visual perception, embodied, and spatial data to establish foundational capabilities.
- Post-training (SFT + RL + RFT):
- Supervised Fine-Tuning (SFT): Uses ~100k high-quality Chain-of-Thought (CoT) instances.
- Reinforcement Learning (RL): Uses a Group Relative Policy Optimization (GRPO) objective with task-aware reward designs (see Fig. 6 and Eq. 1-5). Rewards are categorized for grounding, regression, trajectory, and textual tasks.
- Rejection Sampling Fine-Tuning (RFT): Iteratively selects high-quality reasoning traces from the model's capability frontier to consolidate RL discoveries.
- On-Policy Distillation (OPD): Transfers knowledge from the large (A32B) to the small (MoT-2B) model by minimizing the KL divergence between teacher and student distributions on student-generated prefixes (Eq. 6-7).
Data Curation
A massive, high-quality dataset of over 100M samples was constructed, categorized into:
- Visual Perception: Omni-Detection (62M), Depth Estimation (36M), Segmentation (5M), Pointing & Counting (11M).
- Embodied-Centric: Grounding, Affordance, Trajectory, Understanding, Planning, and Reasoning data.
- Spatial-Centric: Correspondence, Geometry, Configuration, Measurement, and Dynamics data from 3D datasets.
- General Understanding: Broad-domain data for foundational reasoning.
Empirical Validation / Results
Benchmark Performance
The models were evaluated on a comprehensive suite of 22 benchmarks covering Visual Perception, Embodied Understanding, and Spatial Understanding.
Table 1: Results for HY-Embodied-0.5 MoT-2B (Thinking Mode) vs. Comparable Models
| Capability | Benchmark | HY-Embodied-0.5 MoT-2B | Qwen3-VL-2B | Qwen3-VL-4B | RoboBrain2.5-4B | MiMo1-Embodied-7B |
|---|---|---|---|---|---|---|
| Visual Perception | CV-Bench | 89.2 | 80.0 | 85.7 | 86.9 | 88.8 |
| DA-2K | 92.3 | 69.5 | 76.5 | 79.4 | 72.2 | |
| Embodied Understanding | ERQA | 54.5 | 41.8 | 47.3 | 43.3 | 46.8 |
| EmbSpatial-Bench | 82.8 | 75.9 | 80.7 | 73.8 | 76.2 | |
| RoboBench-MCQ | 49.2 | 36.9 | 45.8 | 44.4 | 43.6 | |
| RoboBench-Planning | 54.2 | 36.2 | 36.4 | 39.2 | 58.7 | |
| RoboSpatial-Home | 55.7 | 45.3 | 63.2 | 62.3 | 61.8 | |
| ShareRobot-Aff. | 26.8 | 19.8 | 25.5 | 25.5 | 9.0 | |
| ShareRobot-Traj. | 73.3 | 41.6 | 62.2 | 81.4 | 50.6 | |
| Ego-Plan2 | 45.5 | 35.5 | 38.8 | 52.6 | 39.9 | |
| Spatial Understanding | 3DSRBench | 57.0 | 39.9 | 43.9 | 44.8 | 42.0 |
| All-Angles Bench | 55.1 | 42.3 | 46.7 | 43.8 | 49.0 | |
| MindCube | 66.3 | 28.4 | 31.0 | 26.9 | 36.2 | |
| MMSI-Bench | 33.2 | 23.6 | 25.1 | 20.5 | 31.9 | |
| RefSpatial-Bench | 45.8 | 28.9 | 45.3 | 56.0 | 48.0 | |
| SAT | 76.7 | 45.3 | 56.7 | 51.3 | 78.7 | |
| SIBench-mini | 58.2 | 42.0 | 50.9 | 47.3 | 53.1 | |
| SITE-Bench-Image | 62.7 | 52.3 | 61.0 | 57.9 | 49.9 | |
| SITE-Bench-Video | 63.5 | 52.2 | 58.0 | 54.8 | 58.9 | |
| ViewSpatial | 53.1 | 37.2 | 41.6 | 36.6 | 36.1 | |
| VSIBench | 60.5 | 48.0 | 55.2 | 41.7 | 48.5 | |
| Where2Place | 68.0 | 45.0 | 59.0 | 65.0 | 63.6 |
- MoT-2B achieves the best performance on 16/22 benchmarks and an average score of 58.0%, outperforming Qwen3-VL-4B (47.8%) and RoboBrain2.5-4B (49.4%).
- It also maintains competitive performance on general VLM benchmarks (Fig. 7).
Table 2: Results for HY-Embodied-0.5 MoE-A32B vs. Frontier VLMs
| Capability | Benchmark | HY-Embodied-0.5 MoE A32B | Kimi K2.5 | Seed 2.0 | Qwen 3.5 A17B | Gemini 3.0 Pro |
|---|---|---|---|---|---|---|
| Visual Perception | CV-Bench | 88.8 | 89.0 | 88.5* | 88.6 | 85.4* |
| DA-2K | 90.2 | 83.4 | 92.3* | 83.3 | 83.6* | |
| Embodied Understanding | ERQA | 62.3 | 59.8 | 61.8* | 61.0 | 65.0* |
| EmbSpatial-Bench | 84.1 | 81.5 | 81.0* | 83.8 | 83.6* | |
| RoboBench-MCQ | 62.8 | 59.0 | 66.5* | 63.8 | 69.2* | |
| RoboBench-Planning | 59.3 | 60.0 | 60.1* | 56.7 | 60.0* | |
| RoboSpatial-Home | 76.6 | 66.0 | 71.7* | 74.9 | 57.1* | |
| ShareRobot-Aff. | 28.6 | 21.5 | 27.5* | 29.3 | 24.8* | |
| ShareRobot-Traj. | 76.9 | 68.5 | 71.8* | 73.8 | 68.7* | |
| Ego-Plan2 | 51.4 | 47.4 | 56.6* | 55.3 | 60.0* | |
| Spatial Understanding | 3DSRBench | 56.6 | 55.9 | 58.2* | 56.6 | 58.3* |
| All-Angles Bench | 71.8 | 64.8 | 69.3* | 72.1 | 73.4* | |
| MindCube | 69.2 | 57.8 | 55.2* | 59.0 | 66.0* | |
| MMSI-Bench | 39.2 | 36.5 | 47.6* | 43.8 | 48.0* | |
| RefSpatial-Bench | 57.2 | 43.3 | 72.2* | 61.0 | 33.2* | |
| SAT | 87.3 | 79.3 | 86.2* | 86.0 | 88.0* | |
| SIBench-mini | 67.3 | 63.0 | 65.9* | 66.3 | 68.0* | |
| SITE-Bench-Image | 74.7 | 73.8 | 75.6* | 77.1 | 75.4* | |
| SITE-Bench-Video | 72.5 | 71.5 | 68.9* | 72.3 | 69.8* | |
| ViewSpatial | 59.8 | 45.2 | 56.4* | 52.2 | 50.8* | |
| VSIBench | 68.3 | 54.2 | 51.0* | 61.1 | 57.9* | |
| Where2Place | 70.0 | 64.0 | 73.0* | 76.0 | 52.0* | |
| *API results collected in March 2026. |
- MoE-A32B achieves an average score of 67.0%, outperforming Gemini 3.0 Pro (63.6%), Seed 2.0 (66.2%), Qwen 3.5 A17B (66.1%), and Kimi K2.5 (61.1%).
Qualitative Analysis & Ablations
- Visualizations: The model demonstrates strong performance on fine-grained perception (depth estimation, detection, counting) and embodied tasks (grounding, scene understanding, planning) – see Fig. 8 & 9.
- Chain-of-Thought: The models exhibit advanced, self-correcting reasoning processes for complex spatial and embodied problems (Fig. 10).
- Architecture Efficiency: The MoT architecture enables faster convergence than standard transformers with negligible inference overhead (Fig. 11).
- Visual Latent Tokens: Attention visualizations show these tokens effectively link salient visual regions with corresponding semantic concepts (Fig. 12).
Robot Control Results
The MoT-2B foundation was adapted into a VLA model and fine-tuned for real-robot tasks.
Fig. 13: Robot Control Success Rates
| Task | HY-Embodied-0.5 VLA | π0 Baseline | π0.5 Baseline |
|---|---|---|---|
| Precision Plug-in Packing | 85% | 80% | 85% |
| Tableware Stacking | 80% | 60% | 85% |
| Mug Hanging | 75% | 45% | 50% |
| The VLA model demonstrates robust and often superior performance in real-world manipulation, validating the transferability of the foundational VLM's capabilities. |
Theoretical and Practical Implications
- Theoretical: The work demonstrates that specialized architectural designs (MoT, latent tokens) and training paradigms (iterative RL/RFT, on-policy distillation) can effectively compress advanced embodied reasoning into efficient models, challenging the notion that such capabilities require massive scale alone.
- Practical: The HY-Embodied-0.5 family provides:
- A highly efficient edge model (MoT-2B) that delivers state-of-the-art embodied performance, enabling real-time deployment on physical agents.
- A powerful cloud model (MoE-A32B) that rivals frontier general VLMs on embodied tasks, suitable for complex planning and simulation.
- A validated pathway for translating robust visual-language understanding into effective physical action (VLA), accelerating the development of capable real-world robots.
- Open-sourced models and code to foster further research in embodied AI.
Conclusion
HY-Embodied-0.5 represents a significant step towards bridging digital intelligence and physical-world competence. Through innovations in modality-adaptive architecture, comprehensive embodied data curation, and an iterative self-evolving training pipeline, the model family achieves state-of-the-art performance across a wide spectrum of perception and reasoning benchmarks tailored for embodied agents. The successful application in downstream robot control tasks confirms its practical utility. Future work will focus on further closing the gap between language models and action models to develop a "real-world brain" for complex applications. All code and models are open-sourced.