# RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

> RoboMME introduces a large-scale benchmark showing that no single memory design is best, with task-dependent performance where symbolic memory excels at counting and perceptual memory at motion tasks.

- **Source:** [arXiv](https://arxiv.org/abs/2603.04639)
- **Published:** 2026-03-09
- **Permalink:** https://picx.dev/p/OJrD9p
- **Whiteboard:** https://picx.dev/p/OJrD9p/image

## Summary

# RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies - Summary

## Summary (Overview)
*   **Introduces RoboMME**, a large-scale, standardized robotic simulation benchmark designed for systematic evaluation of **memory-augmented manipulation policies**. It comprises **16 long-horizon tasks** (770k timesteps) organized into four suites based on a cognitive taxonomy: **Temporal (Counting), Spatial (Permanence), Object (Reference), and Procedural (Imitation) memory**.
*   **Develops the MME-VLA suite**, a family of **14 memory-augmented Vision-Language-Action (VLA) model variants** based on the $\pi_{0.5}$ backbone. It systematically explores **three memory representations (Symbolic, Perceptual, Recurrent)** integrated via **three mechanisms (Memory-as-Context, -Modulator, -Expert)**.
*   **Key Finding: No single memory design is universally best**. Performance is highly **task-dependent**: Symbolic memory excels at counting and short-horizon reasoning, while Perceptual memory is critical for time-sensitive and motion-centric tasks. **Memory-as-Modulator** is the most effective integration strategy for perceptual memory.
*   **Empirical results** show the **perceptual memory variant `FrameSamp+Modul` achieves the best overall performance (44.51% success)** among non-oracle models. Recurrent memory methods underperformed, likely due to unstable fine-tuning. The benchmark remains challenging, with human performance capping at **90.5%**.
*   **Demonstrates real-world transferability**, with similar performance trends observed in physical robot experiments, confirming the benchmark's relevance.

## Introduction and Theoretical Foundation
Effective robotic manipulation in open-world settings often requires **reasoning over past interactions** (history), a capability broadly termed **memory**. Tasks like counting actions, tracking occluded objects, or imitating demonstrations cannot be solved by relying solely on immediate perception.

**Prior work** incorporates memory through diverse representations: **Symbolic** (e.g., language subgoals, point trajectories), **Perceptual** (e.g., multi-frame visual tokens, memory banks), and **Recurrent** (e.g., RNNs, Mamba models). However, evaluations are conducted on **narrow, non-standardized tasks** with different policy backbones, making systematic comparison and understanding of memory designs difficult. Existing benchmarks either lack explicit memory demands or sufficient scale for comprehensive evaluation.

**Theoretical Motivation:** RoboMME's design is grounded in cognitive theories of human memory. It categorizes memory into four dimensions inspired by long-term memory models:
1.  **Temporal Memory**: For event accumulation and ordering (*when*).
2.  **Spatial Memory**: For tracking object locations under occlusion (*where*).
3.  **Object Memory**: For preserving referential identity over time (*what*).
4.  **Procedural Memory**: For reproducing demonstrated motion patterns (*how*).

This taxonomy provides a structured framework for evaluating the diverse memory requirements of long-horizon manipulation.

## Methodology

### 1. The RoboMME Benchmark
*   **Environment:** Built on the ManiSkill simulator with a 7-DOF Franka Panda arm.
*   **Observations:** Multi-view RGB (front & wrist cameras, $256\times256$) and proprioceptive states (joint positions, EEF pose, gripper state).
*   **Actions:** Either 8D joint-space or 7D EEF-space.
*   **Task Design:** 16 intentionally **non-Markovian** tasks divided into four suites, each targeting a primary memory type (see Table 1).
*   **Data Curation:** 1,600 demonstrations (100 per task) generated via keyframe waypoints with injected noise for behavioral diversity. Episodes are long-horizon (avg. 481 steps).

**Table 1: Task Summary**
| Task Name | Memory Type | Avg. #Steps | Key Challenge | Brief Description |
| :--- | :--- | :--- | :--- | :--- |
| **Task Suite: Counting** | | | | |
| PickXTimes | T | 538 | count | Pick and place a cube of a given color for a specified number of repetitions. |
| BinFill | T | 604 | count | Place a specified number of cubes of a given color into the bin as the cubes appear over time. |
| SwingXTimes | T | 435 | count, swing-motion | Swing a cube back and forth between two targets for a specified number of cycles. |
| StopCube | T | 317 | count, time-critical | Press a button exactly when a moving cube reaches the target at a specified occurrence. |
| **Task Suite: Permanence** | | | | |
| VideoUnmask | S | 217 | occlusion | Given a video in which all cubes are masked, uncover the cube of a specified color. |
| ButtonUnmask | S | 267 | occlusion | Press the button, during which all cubes are masked, then uncover the cube of a specified color. |
| VideoUnmaskSwap | S | 348 | occlusion, tracking | Given a video in which all cubes are masked and containers dynamically swap positions, uncover the cube of a specified color. |
| ButtonUnmaskSwap | S | 400 | occlusion, tracking | Press the button, during which all cubes are masked and containers dynamically swap positions, then uncover the cube of a specified color. |
| **Task Suite: Reference** | | | | |
| PickHighlight | O | 346 | visual-referential | Pick up all cubes that were visually highlighted in a short time during interaction. |
| VideoRepick | O+T | 687 | action-referential, tracking, count | Given a video showing a cube being manipulated and relocated, pick up the same cube for a specified number of repetitions. |
| VideoPlaceButton | O+T | 974 | language-referential, long-video, tracking | Given a video with interleaved cube placement and button pressing, place the cube on the target specified by a language-described temporal reference. |
| VideoPlaceOrder | O+T | 1134 | language-referential, long-video, tracking | Given a video showing cube placement across multiple targets, place the cube on the target specified by a language-described ordinal reference. |
| **Task Suite: Imitation** | | | | |
| MoveCube | P | 394 | contact-mode, tool-use | Given a video showing cube transport, replicate the same demonstrated manipulation strategy. |
| InsertPeg | P+O | 479 | precise-motion | Given a video showing peg insertion, grasp the same peg at the same end and insert it into a box following the same demonstrated direction. |
| PatternLock | P | 208 | linear-motion | Given a video showing a linear moving pattern, reproduce the same trajectory on the targets. |
| RouteStick | P | 370 | circular-motion | Given a video showing a circular routing pattern, reproduce the same trajectory around sticks. |
*T: Temporal, S: Spatial, O: Object, P: Procedural*

### 2. Memory-Augmented Policies (MME-VLA Suite)
All models are built upon the $\pi_{0.5}$ VLA backbone. The framework explores combinations of memory representations and integration mechanisms.

**A. Memory Representations**
1.  **Symbolic Memory:** History summarized as language subgoals.
    *   `SimpleSG`: Simple instructions (e.g., "pick up the green cube").
    *   `GroundSG`: Grounded instructions with image coordinates (e.g., "pick up the green cube at [63, 152]").
    *   Subgoals generated by: a fine-tuned **Qwen3-VL-4B** model (`QwenVL`), **Gemini-2.5-Pro** via prompting (`Gemini`), or simulator **ground truth** (`Oracle`).
2.  **Perceptual Memory:** History as a sequence of raw visual tokens from past images.
    *   `TokenDrop`: Removes temporally redundant image patches based on RGB differences.
    *   `FrameSamp`: Uniformly downsamples and concatenates tokens from sampled frames.
3.  **Recurrent Memory:** History compressed into fixed-size latent states.
    *   `TTT`: Test-Time Training, updates fast weights online via a self-supervised loss.
    *   `RMT`: Recurrent Memory Transformer, uses learnable memory slots updated segment-wise.

**B. Memory Integration Mechanisms** (for Perceptual & Recurrent memory)
1.  **Memory-as-Context:** Memory tokens concatenated with input (image, language, proprioception) tokens.
2.  **Memory-as-Modulator:** Uses adaptive LayerNorm (AdaLN). Action features cross-attend to memory tokens to produce scale ($\gamma$) and shift ($\beta$) parameters that modulate normalized action features.
3.  **Memory-as-Expert:** Adds a dedicated, lightweight memory expert. Experts interact via block-wise causal attention (action expert attends to VLM and memory experts).

**Evaluation Setup:**
*   **Models:** 14 MME-VLA variants + 4 prior methods ($\pi_{0.5}$, $\pi_{0.5}$ w/ past actions, SAM2Act+, MemER).
*   **Memory Budget:** Fixed at **512 tokens** for fair comparison.
*   **Training:** Multi-task setup (single model across all tasks).
*   **Evaluation:** 50 episodes per task, averaged over 9 runs (3 checkpoints × 3 seeds).

## Empirical Validation / Results
The main results are presented in Table 3. Key analyses address six research questions (Q1-Q6):

**Table 3: Main Results (Success Rates, %) - Excerpt of Key Models**
| Method | BinFill | PickXtimes | StopCube | ... | **AVG** |
| :--- | :---: | :---: | :---: | :--- | :---: |
| **Human Performance** | 96.00 | 100.0 | 78.00 | ... | **90.50** |
| **MME-VLA w/ Symbolic Memory** | | | | | |
| `GroundSG+Oracle` (Upper Bound) | 85.78 | 100.0 | 49.67 | ... | **84.08** |
| `GroundSG+QwenVL` | 52.00 | 92.67 | 0.00 | ... | **32.70** |
| `SimpleSG+QwenVL` | 77.56 | 95.33 | 0.44 | ... | 29.00 |
| **MME-VLA w/ Perceptual Memory** | | | | | |
| `FrameSamp+Modul` **(Best Overall)** | 39.56 | 87.33 | **42.00** | ... | **44.51** |
| `TokenDrop+Modul` | 34.44 | 83.56 | 5.33 | ... | 38.04 |
| **Other Methods** | | | | | |
| `MemER` | 56.67 | 79.33 | 0.00 | ... | 42.38 |
| $\pi_{0.5}$ (No Memory) | 30.00 | 42.89 | 6.67 | ... | 17.93 |

*Note: Red indicates best in section. $\blacksquare$ indicates overall best for non-oracle models. See paper for full table.*

**Q1: Best performing representation & integration?**
*   **Perceptual memory methods perform best overall.** `FrameSamp+Modul` achieves the highest average success (44.51%).
*   **`FrameSamp` outperforms `TokenDrop`**, likely because aggressive token pruning removes crucial global spatial context.
*   **Memory-as-Modulator** is the most effective integration strategy for perceptual memory, offering a good balance of performance and architectural preservation.

**Q2: Is symbolic reasoning sufficient?**
*   **No.** While the oracle-bound `GroundSG+Oracle` solves many tasks (84.08%), it struggles with **manipulation-intensive** (e.g., `StopCube`, `InsertPeg`) and **cluttered scene** tasks, where precise visuomotor control is the bottleneck.

**Q3: Human performance?**
*   Humans achieve **90.5%** success via a VideoQA setup with oracle low-level control, but still fail on long-horizon and time-sensitive tasks, confirming RoboMME's inherent challenge.

**Q4: Task-dependent effectiveness?**
*   **Yes, strongly.** As shown in Figure 3, different memory designs excel on different task characteristics:
    *   **Symbolic memory** (`GroundSG+QwenVL`) excels at **short-horizon** and **event-salient** tasks.
    *   **Perceptual memory** (`FrameSamp+Modul`) excels at **motion-centric**, **time-sensitive**, and **long-horizon video reasoning** tasks.
    *   **MemER** (hybrid) excels at **dynamic scene-change** tasks.

**Q5: Efficiency-performance trade-off?**
*   Perceptual memory offers the best balance. As memory budget increases, `FrameSamp+Modul` shows consistent performance gains with modest computational increase. Methods relying on external VLM inference (`GroundSG+QwenVL`, `MemER`) incur **3-5x** higher compute costs.

**Q6: Real-world transfer?**
*   **Yes.** Experiments on 4 physical tasks mirroring RoboMME challenges show similar trends (Table 4).

**Table 4: Real-World Experiment Results (Successes/10 trials)**
| Method | PutFruits (Counting) | TrackCube (Spatial) | RepickBlock (Object) | DrawPattern (Procedural) | **Total** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| $\pi_{0.5}$ | 2 | 1 | 1 | 0 | **4/40** |
| `GroundSG+QwenVL` | **9** | 3 | 5 | 2 | **19/40** |
| `FrameSamp+Modul` | 6 | **5** | **6** | **8** | **25/40** |

## Theoretical and Practical Implications
*   **Theoretical:** Provides a **cognitively-grounded taxonomy** (Temporal, Spatial, Object, Procedural) for structuring research into memory for robotics. Demonstrates that effective memory design is not one-size-fits-all but must be **matched to task demands**.
*   **Practical:** Introduces **RoboMME as a standardized benchmark** to enable systematic comparison and progress measurement for memory-augmented policies. The **MME-VLA suite** serves as a controlled testbed for ablating memory designs. Findings guide practitioners: use **symbolic memory for high-level reasoning/counting**, **perceptual memory for motion/ time-sensitive tasks**, and **Memory-as-Modulator for efficient integration**.

## Conclusion
RoboMME establishes a **comprehensive benchmark and evaluation framework** for memory in robotic manipulation. Key conclusions:
1.  **Memory is critical** for long-horizon, history-dependent tasks, and no single memory representation dominates all scenarios.
2.  **Task demands and memory design are interdependent.** Symbolic and perceptual memory offer **complementary strengths**.
3.  The **`FrameSamp+Modul` (perceptual memory)** variant offers the best overall performance-efficiency balance among the tested designs.
4.  **Recurrent memory** underperformed in this study, suggesting need for deeper architectural integration or recurrence-oriented pretraining.

**Future Work** includes extending RoboMME to mobile manipulation, exploring other VLA backbones, and developing **unified frameworks that integrate multiple complementary memory representations**. RoboMME is positioned as a foundation for advancing reliable, memory-augmented robotic generalist agents.

---

_Markdown view of https://picx.dev/p/OJrD9p, served by PicX — AI-generated visual whiteboard summaries of research papers._