Summary (Overview)
- Act2Answer Protocol: Introduces a lightweight embodied evaluation protocol that converts standard VLM knowledge benchmarks into action-based selection tasks. The agent answers multiple-choice questions by placing a cube on one of several candidate images, reducing confounds from control difficulty and long-horizon planning.
- Curated Test Suite: Constructs 1,720 binary-choice episodes across 12 knowledge categories (emotion, attribute, state, color, shape, symmetry, counting, time, public info, traffic, celebrity, living world) adapted from established VLM benchmarks.
- Large-Scale VLM-VLA Comparison: Evaluates 7 state-of-the-art VLA models (π₀, OpenVLA, Magma, Xiaomi-Robotics-R0, InternVLA-M1, SmolVLA, SpatialVLA) and 9 VLM baselines. Finds that VLAs perform strongly on simple perceptual categories (color, shape) but show substantial gaps (≈20–40 points) on richer semantic categories relative to their source VLMs.
- Implicit Knowledge Attenuation: Layerwise linear probing reveals that answer-relevant information is present in intermediate VLM backbone layers but attenuates in the action prediction layers, indicating a bottleneck between semantic representation and action generation.
- Training Supervision Matters: VLA models trained with joint vision-language and robotics supervision (e.g., Magma, Xiaomi-Robotics-R0) consistently outperform those trained primarily on robotics data alone, suggesting VQA co-training helps maintain knowledge-sensitive performance.
Introduction and Theoretical Foundation
Embodied agents require a rich understanding of the world—object properties, typical uses, social norms—to act appropriately. Vision–Language–Action (VLA) models, typically derived by fine-tuning powerful Vision–Language Models (VLMs) on robotics data, are widely proposed as open-world generalizable agents. However, current VLA evaluation almost entirely focuses on manipulation-centric task success (LIBERO, CALVIN, BEHAVIOR-1K), leaving it unclear whether the underlying commonsense and factual knowledge is preserved, catastrophically forgotten, or remains inaccessible for action selection.
The paper decomposes embodied task success into four interacting components: perception, knowledge, control, and environment. It argues that end-to-end success rates conflate these factors, making them non-diagnostic for evaluating knowledge retention. To isolate knowledge, the authors propose an action-grounded evaluation that closely mirrors the multiple-choice format of VLM benchmarks, inspired by cognitive science paradigms where knowledge in nonverbal agents is inferred from conditioned actions rather than verbal reports.
The paper defines seven broad knowledge domains relevant for embodied decision-making:
- Physical world knowledge – object properties, states, affordances, mechanics.
- Temporal knowledge – action semantics, event order, duration.
- Quantitative knowledge – counting, magnitudes, rates.
- Biological knowledge – living vs. non-living, vulnerability, food safety.
- Social knowledge – emotions, intentions, roles, cooperation.
- Normative knowledge – moral rules, safety, context appropriateness.
- Cultural knowledge – shared references, celebrities, symbols.
Methodology
Act2Answer Evaluation Protocol
Each episode starts with a VLM-style question (e.g., “Which is dirty, left or right?”) with two candidate images placed at known positions on a tabletop. The agent receives a natural-language instruction and must indicate its answer by moving a cube onto the chosen image. The prediction is correct if the cube lands within a tolerance radius (\epsilon) around the target image center.
The soft success rate (SR) over (N) binary-choice tasks is defined as:
[ \text{SR} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\big( p^{(i)} \in \mathcal{Z}^+ \big), ]
where (p \in \mathbb{R}^2) is the final cube position, and the workspace is partitioned into target region ((\mathcal{Z}^+)), incorrect region ((\mathcal{Z}^-)), and out-of-bounds ((\mathcal{Z}^\varnothing)):
[ \begin{aligned} \mathcal{Z}^+ &= { p \in \mathcal{W} : |p - p_+| \le \epsilon }, \ \mathcal{Z}^- &= { p \in \mathcal{W} : |p - p_-| \le \epsilon }, \ \mathcal{Z}^\varnothing &= \mathcal{W} \setminus (\mathcal{Z}^+ \cup \mathcal{Z}^-). \end{aligned} ]
Performance is interpreted in three regimes relative to chance 0.5: instruction/perceptual failure (SR < 0.5 – Δ), no reliable usable knowledge (|SR – 0.5| ≤ Δ), and evidence of usable knowledge (SR > 0.5 + Δ). Each example is evaluated in both left/right configurations to reduce positional bias, and the average is reported.
Data Curation
Tasks are sourced from six VLM benchmarks (MLLM-CompBench, IconQA, MMBench, OK-VQA, VL-Think) and manually filtered for instruction length and visual clarity. Each item is rewritten by an LLM into a binary-choice question, then wrapped into the Simpler environment. The final suite contains 1,720 unique binary items, yielding 3,440 episodes after spatial swapping.
| VLM Benchmark | Knowledge Domain(s) |
|---|---|
| MLLM-CompBench | Emotion, Attribute, State |
| IconQA | Time, Shape, Symmetry, Counting |
| MMBench | Celebrity |
| OK-VQA | Living World |
| VL-Think | Public Info, Traffic, Color |
Layerwise Intent Probing
To localize answer-relevant information, linear classifiers are independently trained on hidden states from every layer (both VLM backbone and Action Expert) of each VLA model. The probes predict the correct answer label (y \in {0,1}). The Chance-Normalized Retention metric compares the strongest above-chance signal in the Action Expert to that in the backbone:
[ \text{Retention} = \frac{\max_n (s^{\text{exp}}_n - c)}{\max_n (s^{\text{bb}}_n - c) + \varepsilon}, ]
where (c) is chance-level accuracy and (s^{\text{bb}}_n, s^{\text{exp}}_n) are probe accuracies at backbone and Action Expert layers respectively.
Empirical Validation / Results
Table 2: Knowledge-Sensitive Performance
| Model | Emotion | Attribute | State | Color | Shape | Symmetry | Counting | Time | Public Info | Traffic | Celebrity | Living World |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VLM Baselines (action-free text probe) | ||||||||||||
| InternVL3.5-8B | 95% | 68% | 64% | 100% | 89% | 69% | 52% | 99% | 85% | 75% | 99% | 91% |
| Qwen2.5-32B | 99% | 69% | 69% | 100% | 93% | 83% | 61% | 99% | 85% | 86% | 100% | 96% |
| ... | ||||||||||||
| VLA Models (Act2Answer action) | ||||||||||||
| OpenVLA | 48% | 51% | 49% | 89% | 64% | 45% | 48% | 49% | 49% | 46% | 50% | 52% |
| SpatialVLA | 47% | 48% | 50% | 87% | 83% | 45% | 52% | 46% | 51% | 57% | 55% | 49% |
| Magma | 72% | 63% | 59% | 89% | 81% | 37% | 51% | 77% | 88% | 80% | 94% | 77% |
| Xiaomi-Robotics-R0 | 63% | 52% | 50% | 91% | 82% | 58% | 48% | 52% | 64% | 57% | 68% | 56% |
Key findings by research question:
- RQ1 – Simple primitives (Color, Shape): nearly all VLAs score high (>80%), indicating basic perceptual knowledge is preserved.
- RQ2 – Complex semantics (Emotion, Attribute, State, Time, Symmetry, Counting): most VLAs remain at or near chance; only Magma consistently exceeds threshold.
- RQ3 – VLM-VLA gap: source VLMs outperform their VLA counterparts by roughly 20–40 points across most categories, confirming substantial knowledge attenuation after robotics adaptation.
- RQ4 – Probing (Figure 4, Table 3): Probe accuracies in VLM backbone middle layers are above chance, but decline in final Action Expert layers, suggesting knowledge is retained internally but fails to translate into correct actions. Retention metric (Table 3) shows Magma highest (0.87), π₀ lowest (0.36).
- RQ5 – VQA co-training: Models trained with joint vision-language supervision (Magma, Xiaomi-Robotics-R0, InternVLA-M1) perform better on high-level semantic tasks than robotics-only models (OpenVLA, π₀, SpatialVLA).
- RQ6 – Downstream fine-tuning: Additional SFT/RL on a pick-and-place dataset does not consistently improve knowledge-sensitive performance; some categories (State, Color) even drop after fine-tuning.
Table 3: Layerwise Retention Metrics
| Model | Prefix max accuracy | Action max accuracy | Retention |
|---|---|---|---|
| Magma | 75.23 | 72.60 | 0.8702 |
| Xiaomi-Robotics-R0 | 68.04 | 64.98 | 0.8159 |
| SpatialVLA | 65.70 | 62.60 | 0.7808 |
| OpenVLA | 68.71 | 64.61 | 0.7697 |
| SmolVLA | 63.18 | 57.73 | 0.5809 |
| π₀ | 64.99 | 55.40 | 0.3620 |
Theoretical and Practical Implications
The results challenge the implicit assumption that fine-tuning VLMs on action data preserves world knowledge. The systematic VLM-VLA gap highlights that current training pipelines may prioritize low-level control adaptation at the expense of richer semantic understanding. The layerwise probing reveals that knowledge is not necessarily erased—it remains linearly recoverable from intermediate representations—but becomes disconnected from the action head, pointing to a semantic-to-action bottleneck.
From a practical perspective, a robot that can grasp a cup but cannot distinguish a “dirty” cup from a “clean” one, or a “sad” human from a “neutral” one, has limited usefulness in everyday environments. The findings advocate for:
- Training objectives that explicitly maintain and align semantic knowledge with motor policies (e.g., VQA co-training).
- Architectural designs that preserve answer-relevant information through the action layers.
- Evaluation protocols like Act2Answer that go beyond task success to measure knowledge retention.
Conclusion
The paper introduces Act2Answer, a protocol that adapts VLM knowledge benchmarks to the embodied setting via action-based answer selection. It curates a diverse test suite of 1,720 binary-choice episodes across 12 knowledge categories and evaluates 7 VLA models alongside 9 VLM baselines. Key findings include:
- Current VLAs perform well on simple perceptual categories (color, shape) but struggle on richer semantic, temporal, normative, cultural, and biological categories.
- A large performance gap exists between source VLMs and their VLA counterparts, confirming knowledge attenuation during robotics fine-tuning.
- Layerwise probing shows answer-relevant information is present in intermediate layers but attenuates in action layers, indicating a bottleneck between semantic understanding and action generation.
- Continued vision-language supervision (VQA co-training) is associated with better knowledge-sensitive performance; additional downstream fine-tuning can further harm it.
These results suggest that simply fine-tuning VLMs on action data is insufficient. The next generation of embodied agents requires architectures and training objectives that maintain and align the backbone’s semantic understanding with learned motor policies, rather than allowing stronger control adaptation to degrade broader knowledge-sensitive capabilities. The Act2Answer benchmark is publicly available at tttonyalpha.github.io/act2answer.
Related papers
- Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
Relative wrist translation in the head-camera frame bridges human and robot actions, outperforming 6DoF on 15 bi-manual tasks.
- DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation
DomainShuttle decouples video and reference image processing to achieve both high subject fidelity and flexible cross-domain video generation.
- Dockerless: Environment-Free Program Verifier for Coding Agents
Dockerless achieves state-of-the-art open-source SWE-bench results via environment-free agentic verification, matching execution-based verifiers without Docker.