Summary of "MolmoAct2: Action Reasoning Models for Real-World Deployment"

Summary (Overview)

Fully Open System: MolmoAct2 is a fully open (weights, data, code) Vision-Language-Action (VLA) model built for practical robot deployment, addressing issues of closed frontier models, high latency, and platform dependency.
Enhanced Backbone & Data: Introduces Molmo2-ER, a VLM specialized for spatial/embodied reasoning, and releases three new high-quality, filtered robot datasets for low-to-medium cost platforms (bimanual YAM, DROID Franka, SO-100/101).
Novel Architecture: Features a redesigned VLA architecture that grafts a flow-matching continuous-action expert onto a discrete-token VLM via a per-layer KV-cache conditioning mechanism.
Adaptive Reasoning: Proposes MolmoAct2-Think, a reasoning variant that uses adaptive-depth perception tokens, re-predicting depth only for changed scene regions to retain geometric grounding at a fraction of prior latency.
State-of-the-Art Performance: Demonstrates superior performance across 7 simulation/real-world benchmarks, outperforming strong baselines like π0.5. Molmo2 -ER surpasses GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks.

Introduction and Theoretical Foundation

The long-standing goal in robotics is a single generalist controller that can handle diverse, unforeseen tasks across different environments and robot bodies (embodiments). Vision-Language-Action (VLA) models, built on web-scale pre-trained Vision-Language Models (VLMs), represent a promising path toward this goal.

However, current VLAs fall short for real-world deployment due to four key issues:

Closed Systems: Frontier models (e.g., from Google, OpenAI) are proprietary, hindering scientific progress and adaptation.
Prohibitive Latency: Explicit reasoning mechanisms (chain-of-thought, predicted images) dominate inference time, making them too slow for closed-loop control.
Platform Dependency: Open-weight alternatives are often tied to expensive, specialized hardware, limiting accessibility.
Brittle Performance: Zero-shot and fine-tuned success rates remain below dependable deployment thresholds.

MolmoAct2 is introduced to address these shortcomings directly. It is designed to be fully open, deployable out-of-the-box on accessible embodiments, performant, and capable of fast, interpretable reasoning. The work builds upon its predecessor, MolmoAct, and advances it along five axes: a stronger embodied-reasoning VLM backbone, new training datasets, an open-source action tokenizer, a redesigned VLA architecture, and a new adaptive reasoning paradigm.

Methodology

1. Molmo2-ER: Specialized Embodied Reasoning Backbone

Built on top of Molmo2, Molmo2-ER is a VLM specialized for the spatial and embodied reasoning skills critical for robot policies (e.g., metric distances, free space, cross-view tracking).

Training Data: Curated a new corpus of ~3.3M samples across six "pillars": Image Embodied QA, Image Pointing, Image Detection, Video Embodied QA, Multi-image/Ego-Exo reasoning, and Abstract Embodied Reasoning. Sources include SAT, RoboPoint-QA, RefSpatial, VST-P, VSI-590K, SIMS-VSI, RoboVQA, SenseNova-SI, CLEVR, and GRiD-3D.
Training Recipe: A two-stage specialize-then-rehearse approach.
1. Stage 1 (Embodied Specialization): Fine-tune the Molmo2 checkpoint on the new Molmo2-ER corpus for 20K steps.
2. Stage 2 (Joint Refinement): Continue training for 1.5K steps on a mixture of the embodied corpus (50%) and Molmo2's original general multimodal data (50%).

2. Data Curation

MolmoAct2 is trained on a diverse mixture of data:

New Robot Datasets:
- MolmoAct2-BimanualYAM Dataset: 720 hours (34.5k demos) of teleoperated bimanual trajectories on a custom, affordable (<$6k) setup. The largest open bimanual dataset to date.
- MolmoAct2-SO100/101 Dataset: A quality-filtered subset of community-sourced SO-10x data (38k episodes), processed via a four-stage filtering pipeline.
- MolmoAct2-DROID Dataset: A quality-filtered Franka subset of DROID (74.6k episodes), using extended language annotations and an idle-frame filter.
Additional Data: Public academic robotics datasets (subset of OXE, MolmoAct Dataset) and a multimodal web data mixture (46% Molmo2-ER data, 46% Molmo2 data, 8% Tulu text data).

3. MolmoAct2 Model Architecture & Training

Follows a three-stage pipeline: Pre-training, Post-training, and Deployment.

3.1 Pre-training (`MolmoAct2-Pretrain`)

Adapts the Molmo2-ER backbone into a discrete autoregressive robot policy.

OpenFAST Tokenizer: An open-weight, open-data action tokenizer trained on 1M sequences across 5 embodiments. It maps a 1-second, 32-dimensional continuous action trajectory into a compact discrete token sequence (2048-token vocabulary).
Training Recipe: Mixes multimodal (10%) and robot (90%) data. Uses a single next-token prediction objective across text, images, state tokens, and discrete OpenFAST Tokenizer action tokens. Trained for 200K steps.

3.2 Post-training (`MolmoAct2-Post`)

Attaches a continuous action expert to the pre-trained VLM to produce final continuous control.

Action Expert & KV Connection: A DiT-style flow-matching expert predicts the velocity field for denoising action trajectories. Its key innovation is a per-layer KV connection to the VLM. For each VLM layer ℓ, its key-value cache (K_vlm^ℓ, V_vlm^ℓ) is projected and fed as conditioning to the corresponding layer in the action expert via cross-attention: $\tilde{K}_ℓ = \text{reshape}(P_K K_{vlm}^ℓ), \quad \tilde{V}_ℓ = \text{reshape}(P_V V_{vlm}^ℓ)$ This gives the expert dense access to the VLM's hierarchical visual-semantic features.
Training Objective: Combines the discrete autoregressive loss (L_LM) and the continuous flow-matching loss (L_flow): $L_{post} = L_{LM} + L_{flow}$ where the flow loss for an action chunk a and context c is: $L_{flow}(a, c) = \frac{1}{K} \sum_{i=1}^{K} \| m \odot (f_θ(x_{t_i}, t_i, c) - (a - ϵ_i)) \|_2^2$ with x_t = (1-t)ϵ + ta. They use K=4 flow samples per chunk. The VLM is isolated from the flow loss via knowledge insulation (detached KV cache).

3.3 Deployment & Fine-tuning

Embodiment-Specific Fine-tuning: Starting from the post-trained checkpoint, models are fine-tuned on robot-only data for specific platforms (YAM, DROID, SO-100/101, LIBERO) with a similar recipe.
Inference Optimization: Uses caching of reusable intermediates and CUDA Graphs to capture the fixed-shape flow loop, significantly reducing latency.

4. MolmoAct2-Think: Adaptive Depth Reasoning

Extends MolmoAct2 with an interpretable, depth-aware reasoning step that is adaptive across time to reduce latency.

Adaptive Depth Perception: Depth maps are quantized into a 10x10 grid of 128 discrete depth-code tokens. Instead of re-predicting all 100 tokens every step, the model reuses cached codes for static scene regions and only predicts new codes for regions where the RGB evidence changes (based on a cosine similarity threshold).
Training: During post-training, the model is trained on three output styles: action-only, depth-only, and depth-and-action (where the action expert conditions on the predicted depth tokens). Fine-tuning includes noise injection on depth tokens and a learned per-layer gate on the depth portion of the KV conditioning.

Empirical Validation / Results

The paper presents one of the most extensive evaluations of an open VLA to date.

1. Molmo2-ER Evaluation (Table 3)

Molmo2-ER outperforms all baseline VLMs on 9 of 13 embodied reasoning benchmarks, achieving a state-of-the-art overall average of 63.8%, surpassing Gemini Robotics ER-1.5 Thinking (61.3%) and GPT-5 (57.9%). It improves over its base model, Molmo2, by 17 points.

2. Out-of-the-Box Deployment

Simulation (MolmoSpaces & MolmoBot): MolmoAct2-DROID achieves SOTA performance, outperforming π0.5-DROID and others.
- Table 4 (MolmoSpaces Pick & Place): MolmoAct2-DROID achieves 37.7% average success vs. π0.5-DROID's 34.5%.
- Table我们发现了一个问题： 用户提供的文本中，Table 5 的标题是 "Evaluation on simulation held-out environments"，但内容似乎是 MolmoBot 的结果。根据上下文，Table 5 应展示 MolmoBot 结果。MolmoAct2-DROID achieves 20.6% average vs. π0.5-DROID's 10.0%.
Real-World (DROID setup - Table 6): MolmoAct2-DROID achieves 87.1% average success on 5 novel tasks, significantly outperforming π0.5-DROID (45.2%) and MolmoBot (48.4%).
Real-World (SO-100 setup - Table 7): MolmoAct2-SO100/101 achieves 56.7% average score, outperforming the open π0-SO100/101 baseline (45.3%).

3. Effective Fine-Tuning

LIBERO Benchmark (Table 8): MolmoAct2 achieves 97.2% average success, the highest among compared open-weight methods, improving over its predecessor MolmoAct-7B-D (86.6%). MolmoAct2-Think further improves to 98.1%.
RoboEval Benchmark (Figure 6): MolmoAct2 attains a 44.3% success rate, surpassing π0.5 (40.5%). It also generates higher-quality trajectories with shorter completion times, path lengths, and improved smoothness/stability metrics.
Real-World Bimanual YAM Tasks: MolmoAct2 fine-tuned on evaluation data outperforms 4 strong baselines by a large margin (15% over the runner-up).

4. Ablation Studies & Inference Speed

Systematic Ablations (Tables 9-13): Key findings:
- The Molmo2-ER backbone provides a +6.0% gain over Molmo2 for action prediction (Table 9).
- The per-layer KV connection outperforms hidden-state conditioning (Table 10).
- Using K=8 flow samples per chunk is optimal (Table 11).
- Co-training discrete and continuous objectives with full fine-tuning works best (Table 12).
- Depth-token noise injection and a learned depth gate are beneficial for MolmoAct2-Think (Table 13).
Inference Speed (Figure 8): With CUDA Graph optimizations, MolmoAct2 reaches 55.79 Hz control rate (2.42x speedup). MolmoAct2-Think reaches 12.71 Hz (1.58x speedup), making its adaptive-depth reasoning practical.

Theoretical and Practical Implications

Democratizing Robotics AI: By releasing a fully open stack (model, data, code) that performs at or above the state-of-the-art, MolmoAct2 lowers the barrier to entry for VLA research and enables practitioners to adapt models to their own robots and tasks.
Architectural Innovation: The per-layer KV-cache conditioning provides a novel, effective method for tightly coupling a continuous control expert to a discrete-token VLM, preserving rich intermediate representations.
Efficient Embodied Reasoning: The adaptive-depth token approach demonstrates that explicit geometric reasoning can be integrated into VLAs without incurring prohibitive latency, by exploiting temporal redundancy in robot trajectories.
Path to Real-World Deployment: The focus on accessible platforms (low-to-medium cost), trajectory quality beyond simple success rate, and efficient fine-tuning directly addresses practical deployment concerns in real-world settings like homes, labs, and factories.

Conclusion

MolmoAct2 is a family of fully open action reasoning models built to bridge the gap between academic VLA research and real-world robotic deployment. By introducing a specialized embodied reasoning backbone (Molmo2-ER), new high-quality datasets, a novel VLA architecture with per-layer KV conditioning, and an adaptive-depth reasoning variant (MolmoAct2-Think), it achieves state-of-the-art performance across a wide range of simulation and real-world benchmarks.

The work demonstrates that open models can match or surpass the performance of closed frontier systems while being more adaptable, interpretable, and efficient. By releasing all components openly, the authors aim for MolmoAct2 to be more than an academic foundation model—to be a practical tool that can be deployed in real-world workflows to deliver meaningful social impact. Future directions likely include scaling to more embodiments, tasks, and further improving the efficiency and robustness of the reasoning mechanisms.