Summary of "HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents"

Summary (Overview)

Model Family: Introduces HY-Embodied-0.5, a family of Vision-Language Models (VLMs) specifically designed for real-world embodied agents, comprising an efficient MoT-2B model (2B activated params) for edge deployment and a powerful MoE-A32B model (32B activated params) for complex reasoning.
Core Innovations: Proposes a Mixture-of-Transformers (MoT) architecture for modality-adaptive computing, visual latent tokens to bridge vision and language, and an iterative, self-evolving post-training paradigm combining reinforcement learning (RL) and rejection sampling fine-tuning (RFT) to enhance reasoning.
Training Strategy: Employs a comprehensive data pipeline with over 100M samples spanning visual perception, spatial reasoning, and embodied tasks, followed by on-policy distillation to transfer capabilities from the large to the small model.
Strong Performance: The MoT-2B model outperforms similarly-sized state-of-the-art models on 16 out of 22 benchmarks. The MoE-A32B model achieves an average score of 67.0%, surpassing frontier models like Gemini 3.0 Pro (63.6%).
Real-World Validation: The foundation model is successfully adapted into a Vision-Language-Action (VLA) model for robot control, achieving compelling success rates (e.g., 75%-85%) in real-world physical manipulation tasks like packing, stacking, and hanging.

Introduction and Theoretical Foundation

The research is motivated by the need to bridge the gap between general-purpose Vision-Language Models (VLMs) and the specific demands of embodied agents operating in the physical world. While LLMs and VLMs have advanced digital agents, two key deficiencies hinder their application in physical environments:

Fine-Grained Visual Perception: Existing VLMs lack the precise, granular visual perception required for physical grounding and informed action decisions.
Embodied Prediction, Interaction, and Planning: Mainstream VLMs, trained on static web data, are not optimized for the dynamic, action-oriented capabilities needed for prediction, interaction, and planning in the physical world.

The theoretical foundation is built on the VLM paradigm, positing that embodied VLMs uniquely bridge LLM agents and physical agents, leveraging open-world knowledge for real-world tasks. The goal is to systematically enhance capabilities in both visual perception and embodied reasoning through innovations in architecture, data, and training.

Methodology

Model Architecture

The architecture is based on a vision encoder (HY-ViT 2.0) and a large language model (Hunyuan-1.8B), with key enhancements:

HY-ViT 2.0: An efficient native-resolution Vision Transformer (400M params) trained via distillation for accurate, robust perception on edge devices.
Mixture-of-Transformers (MoT): Introduces modality-specific parameters (QKV, FFN) for visual and textual tokens. Visual tokens use bidirectional (full) attention, while text tokens use causal attention. This improves visual modeling without degrading language capabilities.
Visual Latent Tokens: Learnable tokens appended to each visual element, supervised by global features from a teacher ViT to improve perceptual representation and bridge modalities.
Training Objectives: During pre-training, the model is optimized with a combined loss: $L_{\text{total}} = L_{\text{llm}} + L_{\text{vision}} + L_{\text{global}}$ where:
- $L_{\text{vision}} = -\frac{1}{N_v} \sum_{i=1}^{N_v} \log p_i(z_i)$ is the next-visual-code prediction loss.
- $L_{\text{global}} = -\frac{f_{\text{latent}}^\top f_{\text{teacher}}}{\|f_{\text{latent}}\|\|f_{\text{teacher}}\|}$ aligns the visual latent token with global image semantics.

Training Pipeline

The training is a multi-stage process (see Fig. 5):

Pre-training & Mid-training: Trained on over 600B tokens of mixed general, visual perception, embodied, and spatial data to establish foundational capabilities.
Post-training (SFT + RL + RFT):
- Supervised Fine-Tuning (SFT): Uses ~100k high-quality Chain-of-Thought (CoT) instances.
- Reinforcement Learning (RL): Uses a Group Relative Policy Optimization (GRPO) objective with task-aware reward designs (see Fig. 6 and Eq. 1-5). Rewards are categorized for grounding, regression, trajectory, and textual tasks.
- Rejection Sampling Fine-Tuning (RFT): Iteratively selects high-quality reasoning traces from the model's capability frontier to consolidate RL discoveries.
On-Policy Distillation (OPD): Transfers knowledge from the large (A32B) to the small (MoT-2B) model by minimizing the KL divergence between teacher and student distributions on student-generated prefixes (Eq. 6-7).

Data Curation

A massive, high-quality dataset of over 100M samples was constructed, categorized into:

Visual Perception: Omni-Detection (62M), Depth Estimation (36M), Segmentation (5M), Pointing & Counting (11M).
Embodied-Centric: Grounding, Affordance, Trajectory, Understanding, Planning, and Reasoning data.
Spatial-Centric: Correspondence, Geometry, Configuration, Measurement, and Dynamics data from 3D datasets.
General Understanding: Broad-domain data for foundational reasoning.

Empirical Validation / Results

Benchmark Performance

The models were evaluated on a comprehensive suite of 22 benchmarks covering Visual Perception, Embodied Understanding, and Spatial Understanding.

Table 1: Results for HY-Embodied-0.5 MoT-2B (Thinking Mode) vs. Comparable Models

Capability	Benchmark	HY-Embodied-0.5 MoT-2B	Qwen3-VL-2B	Qwen3-VL-4B	RoboBrain2.5-4B	MiMo1-Embodied-7B
Visual Perception	CV-Bench	89.2	80.0	85.7	86.9	88.8
	DA-2K	92.3	69.5	76.5	79.4	72.2
Embodied Understanding	ERQA	54.5	41.8	47.3	43.3	46.8
	EmbSpatial-Bench	82.8	75.9	80.7	73.8	76.2
	RoboBench-MCQ	49.2	36.9	45.8	44.4	43.6
	RoboBench-Planning	54.2	36.2	36.4	39.2	58.7
	RoboSpatial-Home	55.7	45.3	63.2	62.3	61.8
	ShareRobot-Aff.	26.8	19.8	25.5	25.5	9.0
	ShareRobot-Traj.	73.3	41.6	62.2	81.4	50.6
	Ego-Plan2	45.5	35.5	38.8	52.6	39.9
Spatial Understanding	3DSRBench	57.0	39.9	43.9	44.8	42.0
	All-Angles Bench	55.1	42.3	46.7	43.8	49.0
	MindCube	66.3	28.4	31.0	26.9	36.2
	MMSI-Bench	33.2	23.6	25.1	20.5	31.9
	RefSpatial-Bench	45.8	28.9	45.3	56.0	48.0
	SAT	76.7	45.3	56.7	51.3	78.7
	SIBench-mini	58.2	42.0	50.9	47.3	53.1
	SITE-Bench-Image	62.7	52.3	61.0	57.9	49.9
	SITE-Bench-Video	63.5	52.2	58.0	54.8	58.9
	ViewSpatial	53.1	37.2	41.6	36.6	36.1
	VSIBench	60.5	48.0	55.2	41.7	48.5
	Where2Place	68.0	45.0	59.0	65.0	63.6

MoT-2B achieves the best performance on 16/22 benchmarks and an average score of 58.0%, outperforming Qwen3-VL-4B (47.8%) and RoboBrain2.5-4B (49.4%).
It also maintains competitive performance on general VLM benchmarks (Fig. 7).

Table 2: Results for HY-Embodied-0.5 MoE-A32B vs. Frontier VLMs

Capability	Benchmark	HY-Embodied-0.5 MoE A32B	Kimi K2.5	Seed 2.0	Qwen 3.5 A17B	Gemini 3.0 Pro
Visual Perception	CV-Bench	88.8	89.0	88.5*	88.6	85.4*
	DA-2K	90.2	83.4	92.3*	83.3	83.6*
Embodied Understanding	ERQA	62.3	59.8	61.8*	61.0	65.0*
	EmbSpatial-Bench	84.1	81.5	81.0*	83.8	83.6*
	RoboBench-MCQ	62.8	59.0	66.5*	63.8	69.2*
	RoboBench-Planning	59.3	60.0	60.1*	56.7	60.0*
	RoboSpatial-Home	76.6	66.0	71.7*	74.9	57.1*
	ShareRobot-Aff.	28.6	21.5	27.5*	29.3	24.8*
	ShareRobot-Traj.	76.9	68.5	71.8*	73.8	68.7*
	Ego-Plan2	51.4	47.4	56.6*	55.3	60.0*
Spatial Understanding	3DSRBench	56.6	55.9	58.2*	56.6	58.3*
	All-Angles Bench	71.8	64.8	69.3*	72.1	73.4*
	MindCube	69.2	57.8	55.2*	59.0	66.0*
	MMSI-Bench	39.2	36.5	47.6*	43.8	48.0*
	RefSpatial-Bench	57.2	43.3	72.2*	61.0	33.2*
	SAT	87.3	79.3	86.2*	86.0	88.0*
	SIBench-mini	67.3	63.0	65.9*	66.3	68.0*
	SITE-Bench-Image	74.7	73.8	75.6*	77.1	75.4*
	SITE-Bench-Video	72.5	71.5	68.9*	72.3	69.8*
	ViewSpatial	59.8	45.2	56.4*	52.2	50.8*
	VSIBench	68.3	54.2	51.0*	61.1	57.9*
	Where2Place	70.0	64.0	73.0*	76.0	52.0*
*API results collected in March 2026.

MoE-A32B achieves an average score of 67.0%, outperforming Gemini 3.0 Pro (63.6%), Seed 2.0 (66.2%), Qwen 3.5 A17B (66.1%), and Kimi K2.5 (61.1%).

Qualitative Analysis & Ablations

Visualizations: The model demonstrates strong performance on fine-grained perception (depth estimation, detection, counting) and embodied tasks (grounding, scene understanding, planning) – see Fig. 8 & 9.
Chain-of-Thought: The models exhibit advanced, self-correcting reasoning processes for complex spatial and embodied problems (Fig. 10).
Architecture Efficiency: The MoT architecture enables faster convergence than standard transformers with negligible inference overhead (Fig. 11).
Visual Latent Tokens: Attention visualizations show these tokens effectively link salient visual regions with corresponding semantic concepts (Fig. 12).

Robot Control Results

The MoT-2B foundation was adapted into a VLA model and fine-tuned for real-robot tasks.

Fig. 13: Robot Control Success Rates

Task	HY-Embodied-0.5 VLA	π0 Baseline	π0.5 Baseline
Precision Plug-in Packing	85%	80%	85%
Tableware Stacking	80%	60%	85%
Mug Hanging	75%	45%	50%
The VLA model demonstrates robust and often superior performance in real-world manipulation, validating the transferability of the foundational VLM's capabilities.

Theoretical and Practical Implications

Theoretical: The work demonstrates that specialized architectural designs (MoT, latent tokens) and training paradigms (iterative RL/RFT, on-policy distillation) can effectively compress advanced embodied reasoning into efficient models, challenging the notion that such capabilities require massive scale alone.
Practical: The HY-Embodied-0.5 family provides:
1. A highly efficient edge model (MoT-2B) that delivers state-of-the-art embodied performance, enabling real-time deployment on physical agents.
2. A powerful cloud model (MoE-A32B) that rivals frontier general VLMs on embodied tasks, suitable for complex planning and simulation.
3. A validated pathway for translating robust visual-language understanding into effective physical action (VLA), accelerating the development of capable real-world robots.
4. Open-sourced models and code to foster further research in embodied AI.

Conclusion

HY-Embodied-0.5 represents a significant step towards bridging digital intelligence and physical-world competence. Through innovations in modality-adaptive architecture, comprehensive embodied data curation, and an iterative self-evolving training pipeline, the model family achieves state-of-the-art performance across a wide spectrum of perception and reasoning benchmarks tailored for embodied agents. The successful application in downstream robot control tasks confirms its practical utility. Future work will focus on further closing the gap between language models and action models to develop a "real-world brain" for complex applications. All code and models are open-sourced.