RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies - Summary

Summary (Overview)

  • Introduces RoboMME, a large-scale, standardized robotic simulation benchmark designed for systematic evaluation of memory-augmented manipulation policies. It comprises 16 long-horizon tasks (770k timesteps) organized into four suites based on a cognitive taxonomy: Temporal (Counting), Spatial (Permanence), Object (Reference), and Procedural (Imitation) memory.
  • Develops the MME-VLA suite, a family of 14 memory-augmented Vision-Language-Action (VLA) model variants based on the π0.5\pi_{0.5} backbone. It systematically explores three memory representations (Symbolic, Perceptual, Recurrent) integrated via three mechanisms (Memory-as-Context, -Modulator, -Expert).
  • Key Finding: No single memory design is universally best. Performance is highly task-dependent: Symbolic memory excels at counting and short-horizon reasoning, while Perceptual memory is critical for time-sensitive and motion-centric tasks. Memory-as-Modulator is the most effective integration strategy for perceptual memory.
  • Empirical results show the perceptual memory variant FrameSamp+Modul achieves the best overall performance (44.51% success) among non-oracle models. Recurrent memory methods underperformed, likely due to unstable fine-tuning. The benchmark remains challenging, with human performance capping at 90.5%.
  • Demonstrates real-world transferability, with similar performance trends observed in physical robot experiments, confirming the benchmark's relevance.

Introduction and Theoretical Foundation

Effective robotic manipulation in open-world settings often requires reasoning over past interactions (history), a capability broadly termed memory. Tasks like counting actions, tracking occluded objects, or imitating demonstrations cannot be solved by relying solely on immediate perception.

Prior work incorporates memory through diverse representations: Symbolic (e.g., language subgoals, point trajectories), Perceptual (e.g., multi-frame visual tokens, memory banks), and Recurrent (e.g., RNNs, Mamba models). However, evaluations are conducted on narrow, non-standardized tasks with different policy backbones, making systematic comparison and understanding of memory designs difficult. Existing benchmarks either lack explicit memory demands or sufficient scale for comprehensive evaluation.

Theoretical Motivation: RoboMME's design is grounded in cognitive theories of human memory. It categorizes memory into four dimensions inspired by long-term memory models:

  1. Temporal Memory: For event accumulation and ordering (when).
  2. Spatial Memory: For tracking object locations under occlusion (where).
  3. Object Memory: For preserving referential identity over time (what).
  4. Procedural Memory: For reproducing demonstrated motion patterns (how).

This taxonomy provides a structured framework for evaluating the diverse memory requirements of long-horizon manipulation.

Methodology

1. The RoboMME Benchmark

  • Environment: Built on the ManiSkill simulator with a 7-DOF Franka Panda arm.
  • Observations: Multi-view RGB (front & wrist cameras, 256×256256\times256) and proprioceptive states (joint positions, EEF pose, gripper state).
  • Actions: Either 8D joint-space or 7D EEF-space.
  • Task Design: 16 intentionally non-Markovian tasks divided into four suites, each targeting a primary memory type (see Table 1).
  • Data Curation: 1,600 demonstrations (100 per task) generated via keyframe waypoints with injected noise for behavioral diversity. Episodes are long-horizon (avg. 481 steps).

Table 1: Task Summary

Task NameMemory TypeAvg. #StepsKey ChallengeBrief Description
Task Suite: Counting
PickXTimesT538countPick and place a cube of a given color for a specified number of repetitions.
BinFillT604countPlace a specified number of cubes of a given color into the bin as the cubes appear over time.
SwingXTimesT435count, swing-motionSwing a cube back and forth between two targets for a specified number of cycles.
StopCubeT317count, time-criticalPress a button exactly when a moving cube reaches the target at a specified occurrence.
Task Suite: Permanence
VideoUnmaskS217occlusionGiven a video in which all cubes are masked, uncover the cube of a specified color.
ButtonUnmaskS267occlusionPress the button, during which all cubes are masked, then uncover the cube of a specified color.
VideoUnmaskSwapS348occlusion, trackingGiven a video in which all cubes are masked and containers dynamically swap positions, uncover the cube of a specified color.
ButtonUnmaskSwapS400occlusion, trackingPress the button, during which all cubes are masked and containers dynamically swap positions, then uncover the cube of a specified color.
Task Suite: Reference
PickHighlightO346visual-referentialPick up all cubes that were visually highlighted in a short time during interaction.
VideoRepickO+T687action-referential, tracking, countGiven a video showing a cube being manipulated and relocated, pick up the same cube for a specified number of repetitions.
VideoPlaceButtonO+T974language-referential, long-video, trackingGiven a video with interleaved cube placement and button pressing, place the cube on the target specified by a language-described temporal reference.
VideoPlaceOrderO+T1134language-referential, long-video, trackingGiven a video showing cube placement across multiple targets, place the cube on the target specified by a language-described ordinal reference.
Task Suite: Imitation
MoveCubeP394contact-mode, tool-useGiven a video showing cube transport, replicate the same demonstrated manipulation strategy.
InsertPegP+O479precise-motionGiven a video showing peg insertion, grasp the same peg at the same end and insert it into a box following the same demonstrated direction.
PatternLockP208linear-motionGiven a video showing a linear moving pattern, reproduce the same trajectory on the targets.
RouteStickP370circular-motionGiven a video showing a circular routing pattern, reproduce the same trajectory around sticks.
T: Temporal, S: Spatial, O: Object, P: Procedural

2. Memory-Augmented Policies (MME-VLA Suite)

All models are built upon the π0.5\pi_{0.5} VLA backbone. The framework explores combinations of memory representations and integration mechanisms.

A. Memory Representations

  1. Symbolic Memory: History summarized as language subgoals.
    • SimpleSG: Simple instructions (e.g., "pick up the green cube").
    • GroundSG: Grounded instructions with image coordinates (e.g., "pick up the green cube at [63, 152]").
    • Subgoals generated by: a fine-tuned Qwen3-VL-4B model (QwenVL), Gemini-2.5-Pro via prompting (Gemini), or simulator ground truth (Oracle).
  2. Perceptual Memory: History as a sequence of raw visual tokens from past images.
    • TokenDrop: Removes temporally redundant image patches based on RGB differences.
    • FrameSamp: Uniformly downsamples and concatenates tokens from sampled frames.
  3. Recurrent Memory: History compressed into fixed-size latent states.
    • TTT: Test-Time Training, updates fast weights online via a self-supervised loss.
    • RMT: Recurrent Memory Transformer, uses learnable memory slots updated segment-wise.

B. Memory Integration Mechanisms (for Perceptual & Recurrent memory)

  1. Memory-as-Context: Memory tokens concatenated with input (image, language, proprioception) tokens.
  2. Memory-as-Modulator: Uses adaptive LayerNorm (AdaLN). Action features cross-attend to memory tokens to produce scale (γ\gamma) and shift (β\beta) parameters that modulate normalized action features.
  3. Memory-as-Expert: Adds a dedicated, lightweight memory expert. Experts interact via block-wise causal attention (action expert attends to VLM and memory experts).

Evaluation Setup:

  • Models: 14 MME-VLA variants + 4 prior methods (π0.5\pi_{0.5}, π0.5\pi_{0.5} w/ past actions, SAM2Act+, MemER).
  • Memory Budget: Fixed at 512 tokens for fair comparison.
  • Training: Multi-task setup (single model across all tasks).
  • Evaluation: 50 episodes per task, averaged over 9 runs (3 checkpoints × 3 seeds).

Empirical Validation / Results

The main results are presented in Table 3. Key analyses address six research questions (Q1-Q6):

Table 3: Main Results (Success Rates, %) - Excerpt of Key Models

MethodBinFillPickXtimesStopCube...AVG
Human Performance96.00100.078.00...90.50
MME-VLA w/ Symbolic Memory
GroundSG+Oracle (Upper Bound)85.78100.049.67...84.08
GroundSG+QwenVL52.0092.670.00...32.70
SimpleSG+QwenVL77.5695.330.44...29.00
MME-VLA w/ Perceptual Memory
FrameSamp+Modul (Best Overall)39.5687.3342.00...44.51
TokenDrop+Modul34.4483.565.33...38.04
Other Methods
MemER56.6779.330.00...42.38
π0.5\pi_{0.5} (No Memory)30.0042.896.67...17.93

Note: Red indicates best in section. \blacksquare indicates overall best for non-oracle models. See paper for full table.

Q1: Best performing representation & integration?

  • Perceptual memory methods perform best overall. FrameSamp+Modul achieves the highest average success (44.51%).
  • FrameSamp outperforms TokenDrop, likely because aggressive token pruning removes crucial global spatial context.
  • Memory-as-Modulator is the most effective integration strategy for perceptual memory, offering a good balance of performance and architectural preservation.

Q2: Is symbolic reasoning sufficient?

  • No. While the oracle-bound GroundSG+Oracle solves many tasks (84.08%), it struggles with manipulation-intensive (e.g., StopCube, InsertPeg) and cluttered scene tasks, where precise visuomotor control is the bottleneck.

Q3: Human performance?

  • Humans achieve 90.5% success via a VideoQA setup with oracle low-level control, but still fail on long-horizon and time-sensitive tasks, confirming RoboMME's inherent challenge.

Q4: Task-dependent effectiveness?

  • Yes, strongly. As shown in Figure 3, different memory designs excel on different task characteristics:
    • Symbolic memory (GroundSG+QwenVL) excels at short-horizon and event-salient tasks.
    • Perceptual memory (FrameSamp+Modul) excels at motion-centric, time-sensitive, and long-horizon video reasoning tasks.
    • MemER (hybrid) excels at dynamic scene-change tasks.

Q5: Efficiency-performance trade-off?

  • Perceptual memory offers the best balance. As memory budget increases, FrameSamp+Modul shows consistent performance gains with modest computational increase. Methods relying on external VLM inference (GroundSG+QwenVL, MemER) incur 3-5x higher compute costs.

Q6: Real-world transfer?

  • Yes. Experiments on 4 physical tasks mirroring RoboMME challenges show similar trends (Table 4).

Table 4: Real-World Experiment Results (Successes/10 trials)

MethodPutFruits (Counting)TrackCube (Spatial)RepickBlock (Object)DrawPattern (Procedural)Total
π0.5\pi_{0.5}21104/40
GroundSG+QwenVL935219/40
FrameSamp+Modul656825/40

Theoretical and Practical Implications

  • Theoretical: Provides a cognitively-grounded taxonomy (Temporal, Spatial, Object, Procedural) for structuring research into memory for robotics. Demonstrates that effective memory design is not one-size-fits-all but must be matched to task demands.
  • Practical: Introduces RoboMME as a standardized benchmark to enable systematic comparison and progress measurement for memory-augmented policies. The MME-VLA suite serves as a controlled testbed for ablating memory designs. Findings guide practitioners: use symbolic memory for high-level reasoning/counting, perceptual memory for motion/ time-sensitive tasks, and Memory-as-Modulator for efficient integration.

Conclusion

RoboMME establishes a comprehensive benchmark and evaluation framework for memory in robotic manipulation. Key conclusions:

  1. Memory is critical for long-horizon, history-dependent tasks, and no single memory representation dominates all scenarios.
  2. Task demands and memory design are interdependent. Symbolic and perceptual memory offer complementary strengths.
  3. The FrameSamp+Modul (perceptual memory) variant offers the best overall performance-efficiency balance among the tested designs.
  4. Recurrent memory underperformed in this study, suggesting need for deeper architectural integration or recurrence-oriented pretraining.

Future Work includes extending RoboMME to mobile manipulation, exploring other VLA backbones, and developing unified frameworks that integrate multiple complementary memory representations. RoboMME is positioned as a foundation for advancing reliable, memory-augmented robotic generalist agents.