RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies - Summary
Summary (Overview)
- Introduces RoboMME, a large-scale, standardized robotic simulation benchmark designed for systematic evaluation of memory-augmented manipulation policies. It comprises 16 long-horizon tasks (770k timesteps) organized into four suites based on a cognitive taxonomy: Temporal (Counting), Spatial (Permanence), Object (Reference), and Procedural (Imitation) memory.
- Develops the MME-VLA suite, a family of 14 memory-augmented Vision-Language-Action (VLA) model variants based on the backbone. It systematically explores three memory representations (Symbolic, Perceptual, Recurrent) integrated via three mechanisms (Memory-as-Context, -Modulator, -Expert).
- Key Finding: No single memory design is universally best. Performance is highly task-dependent: Symbolic memory excels at counting and short-horizon reasoning, while Perceptual memory is critical for time-sensitive and motion-centric tasks. Memory-as-Modulator is the most effective integration strategy for perceptual memory.
- Empirical results show the perceptual memory variant
FrameSamp+Modulachieves the best overall performance (44.51% success) among non-oracle models. Recurrent memory methods underperformed, likely due to unstable fine-tuning. The benchmark remains challenging, with human performance capping at 90.5%. - Demonstrates real-world transferability, with similar performance trends observed in physical robot experiments, confirming the benchmark's relevance.
Introduction and Theoretical Foundation
Effective robotic manipulation in open-world settings often requires reasoning over past interactions (history), a capability broadly termed memory. Tasks like counting actions, tracking occluded objects, or imitating demonstrations cannot be solved by relying solely on immediate perception.
Prior work incorporates memory through diverse representations: Symbolic (e.g., language subgoals, point trajectories), Perceptual (e.g., multi-frame visual tokens, memory banks), and Recurrent (e.g., RNNs, Mamba models). However, evaluations are conducted on narrow, non-standardized tasks with different policy backbones, making systematic comparison and understanding of memory designs difficult. Existing benchmarks either lack explicit memory demands or sufficient scale for comprehensive evaluation.
Theoretical Motivation: RoboMME's design is grounded in cognitive theories of human memory. It categorizes memory into four dimensions inspired by long-term memory models:
- Temporal Memory: For event accumulation and ordering (when).
- Spatial Memory: For tracking object locations under occlusion (where).
- Object Memory: For preserving referential identity over time (what).
- Procedural Memory: For reproducing demonstrated motion patterns (how).
This taxonomy provides a structured framework for evaluating the diverse memory requirements of long-horizon manipulation.
Methodology
1. The RoboMME Benchmark
- Environment: Built on the ManiSkill simulator with a 7-DOF Franka Panda arm.
- Observations: Multi-view RGB (front & wrist cameras, ) and proprioceptive states (joint positions, EEF pose, gripper state).
- Actions: Either 8D joint-space or 7D EEF-space.
- Task Design: 16 intentionally non-Markovian tasks divided into four suites, each targeting a primary memory type (see Table 1).
- Data Curation: 1,600 demonstrations (100 per task) generated via keyframe waypoints with injected noise for behavioral diversity. Episodes are long-horizon (avg. 481 steps).
Table 1: Task Summary
| Task Name | Memory Type | Avg. #Steps | Key Challenge | Brief Description |
|---|---|---|---|---|
| Task Suite: Counting | ||||
| PickXTimes | T | 538 | count | Pick and place a cube of a given color for a specified number of repetitions. |
| BinFill | T | 604 | count | Place a specified number of cubes of a given color into the bin as the cubes appear over time. |
| SwingXTimes | T | 435 | count, swing-motion | Swing a cube back and forth between two targets for a specified number of cycles. |
| StopCube | T | 317 | count, time-critical | Press a button exactly when a moving cube reaches the target at a specified occurrence. |
| Task Suite: Permanence | ||||
| VideoUnmask | S | 217 | occlusion | Given a video in which all cubes are masked, uncover the cube of a specified color. |
| ButtonUnmask | S | 267 | occlusion | Press the button, during which all cubes are masked, then uncover the cube of a specified color. |
| VideoUnmaskSwap | S | 348 | occlusion, tracking | Given a video in which all cubes are masked and containers dynamically swap positions, uncover the cube of a specified color. |
| ButtonUnmaskSwap | S | 400 | occlusion, tracking | Press the button, during which all cubes are masked and containers dynamically swap positions, then uncover the cube of a specified color. |
| Task Suite: Reference | ||||
| PickHighlight | O | 346 | visual-referential | Pick up all cubes that were visually highlighted in a short time during interaction. |
| VideoRepick | O+T | 687 | action-referential, tracking, count | Given a video showing a cube being manipulated and relocated, pick up the same cube for a specified number of repetitions. |
| VideoPlaceButton | O+T | 974 | language-referential, long-video, tracking | Given a video with interleaved cube placement and button pressing, place the cube on the target specified by a language-described temporal reference. |
| VideoPlaceOrder | O+T | 1134 | language-referential, long-video, tracking | Given a video showing cube placement across multiple targets, place the cube on the target specified by a language-described ordinal reference. |
| Task Suite: Imitation | ||||
| MoveCube | P | 394 | contact-mode, tool-use | Given a video showing cube transport, replicate the same demonstrated manipulation strategy. |
| InsertPeg | P+O | 479 | precise-motion | Given a video showing peg insertion, grasp the same peg at the same end and insert it into a box following the same demonstrated direction. |
| PatternLock | P | 208 | linear-motion | Given a video showing a linear moving pattern, reproduce the same trajectory on the targets. |
| RouteStick | P | 370 | circular-motion | Given a video showing a circular routing pattern, reproduce the same trajectory around sticks. |
| T: Temporal, S: Spatial, O: Object, P: Procedural |
2. Memory-Augmented Policies (MME-VLA Suite)
All models are built upon the VLA backbone. The framework explores combinations of memory representations and integration mechanisms.
A. Memory Representations
- Symbolic Memory: History summarized as language subgoals.
SimpleSG: Simple instructions (e.g., "pick up the green cube").GroundSG: Grounded instructions with image coordinates (e.g., "pick up the green cube at [63, 152]").- Subgoals generated by: a fine-tuned Qwen3-VL-4B model (
QwenVL), Gemini-2.5-Pro via prompting (Gemini), or simulator ground truth (Oracle).
- Perceptual Memory: History as a sequence of raw visual tokens from past images.
TokenDrop: Removes temporally redundant image patches based on RGB differences.FrameSamp: Uniformly downsamples and concatenates tokens from sampled frames.
- Recurrent Memory: History compressed into fixed-size latent states.
TTT: Test-Time Training, updates fast weights online via a self-supervised loss.RMT: Recurrent Memory Transformer, uses learnable memory slots updated segment-wise.
B. Memory Integration Mechanisms (for Perceptual & Recurrent memory)
- Memory-as-Context: Memory tokens concatenated with input (image, language, proprioception) tokens.
- Memory-as-Modulator: Uses adaptive LayerNorm (AdaLN). Action features cross-attend to memory tokens to produce scale () and shift () parameters that modulate normalized action features.
- Memory-as-Expert: Adds a dedicated, lightweight memory expert. Experts interact via block-wise causal attention (action expert attends to VLM and memory experts).
Evaluation Setup:
- Models: 14 MME-VLA variants + 4 prior methods (, w/ past actions, SAM2Act+, MemER).
- Memory Budget: Fixed at 512 tokens for fair comparison.
- Training: Multi-task setup (single model across all tasks).
- Evaluation: 50 episodes per task, averaged over 9 runs (3 checkpoints × 3 seeds).
Empirical Validation / Results
The main results are presented in Table 3. Key analyses address six research questions (Q1-Q6):
Table 3: Main Results (Success Rates, %) - Excerpt of Key Models
| Method | BinFill | PickXtimes | StopCube | ... | AVG |
|---|---|---|---|---|---|
| Human Performance | 96.00 | 100.0 | 78.00 | ... | 90.50 |
| MME-VLA w/ Symbolic Memory | |||||
GroundSG+Oracle (Upper Bound) | 85.78 | 100.0 | 49.67 | ... | 84.08 |
GroundSG+QwenVL | 52.00 | 92.67 | 0.00 | ... | 32.70 |
SimpleSG+QwenVL | 77.56 | 95.33 | 0.44 | ... | 29.00 |
| MME-VLA w/ Perceptual Memory | |||||
FrameSamp+Modul (Best Overall) | 39.56 | 87.33 | 42.00 | ... | 44.51 |
TokenDrop+Modul | 34.44 | 83.56 | 5.33 | ... | 38.04 |
| Other Methods | |||||
MemER | 56.67 | 79.33 | 0.00 | ... | 42.38 |
| (No Memory) | 30.00 | 42.89 | 6.67 | ... | 17.93 |
Note: Red indicates best in section. indicates overall best for non-oracle models. See paper for full table.
Q1: Best performing representation & integration?
- Perceptual memory methods perform best overall.
FrameSamp+Modulachieves the highest average success (44.51%). FrameSampoutperformsTokenDrop, likely because aggressive token pruning removes crucial global spatial context.- Memory-as-Modulator is the most effective integration strategy for perceptual memory, offering a good balance of performance and architectural preservation.
Q2: Is symbolic reasoning sufficient?
- No. While the oracle-bound
GroundSG+Oraclesolves many tasks (84.08%), it struggles with manipulation-intensive (e.g.,StopCube,InsertPeg) and cluttered scene tasks, where precise visuomotor control is the bottleneck.
Q3: Human performance?
- Humans achieve 90.5% success via a VideoQA setup with oracle low-level control, but still fail on long-horizon and time-sensitive tasks, confirming RoboMME's inherent challenge.
Q4: Task-dependent effectiveness?
- Yes, strongly. As shown in Figure 3, different memory designs excel on different task characteristics:
- Symbolic memory (
GroundSG+QwenVL) excels at short-horizon and event-salient tasks. - Perceptual memory (
FrameSamp+Modul) excels at motion-centric, time-sensitive, and long-horizon video reasoning tasks. - MemER (hybrid) excels at dynamic scene-change tasks.
- Symbolic memory (
Q5: Efficiency-performance trade-off?
- Perceptual memory offers the best balance. As memory budget increases,
FrameSamp+Modulshows consistent performance gains with modest computational increase. Methods relying on external VLM inference (GroundSG+QwenVL,MemER) incur 3-5x higher compute costs.
Q6: Real-world transfer?
- Yes. Experiments on 4 physical tasks mirroring RoboMME challenges show similar trends (Table 4).
Table 4: Real-World Experiment Results (Successes/10 trials)
| Method | PutFruits (Counting) | TrackCube (Spatial) | RepickBlock (Object) | DrawPattern (Procedural) | Total |
|---|---|---|---|---|---|
| 2 | 1 | 1 | 0 | 4/40 | |
GroundSG+QwenVL | 9 | 3 | 5 | 2 | 19/40 |
FrameSamp+Modul | 6 | 5 | 6 | 8 | 25/40 |
Theoretical and Practical Implications
- Theoretical: Provides a cognitively-grounded taxonomy (Temporal, Spatial, Object, Procedural) for structuring research into memory for robotics. Demonstrates that effective memory design is not one-size-fits-all but must be matched to task demands.
- Practical: Introduces RoboMME as a standardized benchmark to enable systematic comparison and progress measurement for memory-augmented policies. The MME-VLA suite serves as a controlled testbed for ablating memory designs. Findings guide practitioners: use symbolic memory for high-level reasoning/counting, perceptual memory for motion/ time-sensitive tasks, and Memory-as-Modulator for efficient integration.
Conclusion
RoboMME establishes a comprehensive benchmark and evaluation framework for memory in robotic manipulation. Key conclusions:
- Memory is critical for long-horizon, history-dependent tasks, and no single memory representation dominates all scenarios.
- Task demands and memory design are interdependent. Symbolic and perceptual memory offer complementary strengths.
- The
FrameSamp+Modul(perceptual memory) variant offers the best overall performance-efficiency balance among the tested designs. - Recurrent memory underperformed in this study, suggesting need for deeper architectural integration or recurrence-oriented pretraining.
Future Work includes extending RoboMME to mobile manipulation, exploring other VLA backbones, and developing unified frameworks that integrate multiple complementary memory representations. RoboMME is positioned as a foundation for advancing reliable, memory-augmented robotic generalist agents.