RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies - Summary

Summary (Overview)

Introduces RoboMME, a large-scale, standardized robotic simulation benchmark designed for systematic evaluation of memory-augmented manipulation policies. It comprises 16 long-horizon tasks (770k timesteps) organized into four suites based on a cognitive taxonomy: Temporal (Counting), Spatial (Permanence), Object (Reference), and Procedural (Imitation) memory.
Develops the MME-VLA suite, a family of 14 memory-augmented Vision-Language-Action (VLA) model variants based on the $\pi_{0.5}$ backbone. It systematically explores three memory representations (Symbolic, Perceptual, Recurrent) integrated via three mechanisms (Memory-as-Context, -Modulator, -Expert).
Key Finding: No single memory design is universally best. Performance is highly task-dependent: Symbolic memory excels at counting and short-horizon reasoning, while Perceptual memory is critical for time-sensitive and motion-centric tasks. Memory-as-Modulator is the most effective integration strategy for perceptual memory.
Empirical results show the perceptual memory variant FrameSamp+Modul achieves the best overall performance (44.51% success) among non-oracle models. Recurrent memory methods underperformed, likely due to unstable fine-tuning. The benchmark remains challenging, with human performance capping at 90.5%.
Demonstrates real-world transferability, with similar performance trends observed in physical robot experiments, confirming the benchmark's relevance.

Introduction and Theoretical Foundation

Effective robotic manipulation in open-world settings often requires reasoning over past interactions (history), a capability broadly termed memory. Tasks like counting actions, tracking occluded objects, or imitating demonstrations cannot be solved by relying solely on immediate perception.

Prior work incorporates memory through diverse representations: Symbolic (e.g., language subgoals, point trajectories), Perceptual (e.g., multi-frame visual tokens, memory banks), and Recurrent (e.g., RNNs, Mamba models). However, evaluations are conducted on narrow, non-standardized tasks with different policy backbones, making systematic comparison and understanding of memory designs difficult. Existing benchmarks either lack explicit memory demands or sufficient scale for comprehensive evaluation.

Theoretical Motivation: RoboMME's design is grounded in cognitive theories of human memory. It categorizes memory into four dimensions inspired by long-term memory models:

Temporal Memory: For event accumulation and ordering (when).
Spatial Memory: For tracking object locations under occlusion (where).
Object Memory: For preserving referential identity over time (what).
Procedural Memory: For reproducing demonstrated motion patterns (how).

This taxonomy provides a structured framework for evaluating the diverse memory requirements of long-horizon manipulation.

Methodology

1. The RoboMME Benchmark

Environment: Built on the ManiSkill simulator with a 7-DOF Franka Panda arm.
Observations: Multi-view RGB (front & wrist cameras, $256\times256$ ) and proprioceptive states (joint positions, EEF pose, gripper state).
Actions: Either 8D joint-space or 7D EEF-space.
Task Design: 16 intentionally non-Markovian tasks divided into four suites, each targeting a primary memory type (see Table 1).
Data Curation: 1,600 demonstrations (100 per task) generated via keyframe waypoints with injected noise for behavioral diversity. Episodes are long-horizon (avg. 481 steps).

Table 1: Task Summary

Task Name	Memory Type	Avg. #Steps	Key Challenge	Brief Description
Task Suite: Counting
PickXTimes	T	538	count	Pick and place a cube of a given color for a specified number of repetitions.
BinFill	T	604	count	Place a specified number of cubes of a given color into the bin as the cubes appear over time.
SwingXTimes	T	435	count, swing-motion	Swing a cube back and forth between two targets for a specified number of cycles.
StopCube	T	317	count, time-critical	Press a button exactly when a moving cube reaches the target at a specified occurrence.
Task Suite: Permanence
VideoUnmask	S	217	occlusion	Given a video in which all cubes are masked, uncover the cube of a specified color.
ButtonUnmask	S	267	occlusion	Press the button, during which all cubes are masked, then uncover the cube of a specified color.
VideoUnmaskSwap	S	348	occlusion, tracking	Given a video in which all cubes are masked and containers dynamically swap positions, uncover the cube of a specified color.
ButtonUnmaskSwap	S	400	occlusion, tracking	Press the button, during which all cubes are masked and containers dynamically swap positions, then uncover the cube of a specified color.
Task Suite: Reference
PickHighlight	O	346	visual-referential	Pick up all cubes that were visually highlighted in a short time during interaction.
VideoRepick	O+T	687	action-referential, tracking, count	Given a video showing a cube being manipulated and relocated, pick up the same cube for a specified number of repetitions.
VideoPlaceButton	O+T	974	language-referential, long-video, tracking	Given a video with interleaved cube placement and button pressing, place the cube on the target specified by a language-described temporal reference.
VideoPlaceOrder	O+T	1134	language-referential, long-video, tracking	Given a video showing cube placement across multiple targets, place the cube on the target specified by a language-described ordinal reference.
Task Suite: Imitation
MoveCube	P	394	contact-mode, tool-use	Given a video showing cube transport, replicate the same demonstrated manipulation strategy.
InsertPeg	P+O	479	precise-motion	Given a video showing peg insertion, grasp the same peg at the same end and insert it into a box following the same demonstrated direction.
PatternLock	P	208	linear-motion	Given a video showing a linear moving pattern, reproduce the same trajectory on the targets.
RouteStick	P	370	circular-motion	Given a video showing a circular routing pattern, reproduce the same trajectory around sticks.
T: Temporal, S: Spatial, O: Object, P: Procedural

2. Memory-Augmented Policies (MME-VLA Suite)

All models are built upon the $\pi_{0.5}$ VLA backbone. The framework explores combinations of memory representations and integration mechanisms.

A. Memory Representations

Symbolic Memory: History summarized as language subgoals.
- SimpleSG: Simple instructions (e.g., "pick up the green cube").
- GroundSG: Grounded instructions with image coordinates (e.g., "pick up the green cube at [63, 152]").
- Subgoals generated by: a fine-tuned Qwen3-VL-4B model (QwenVL), Gemini-2.5-Pro via prompting (Gemini), or simulator ground truth (Oracle).
Perceptual Memory: History as a sequence of raw visual tokens from past images.
- TokenDrop: Removes temporally redundant image patches based on RGB differences.
- FrameSamp: Uniformly downsamples and concatenates tokens from sampled frames.
Recurrent Memory: History compressed into fixed-size latent states.
- TTT: Test-Time Training, updates fast weights online via a self-supervised loss.
- RMT: Recurrent Memory Transformer, uses learnable memory slots updated segment-wise.

B. Memory Integration Mechanisms (for Perceptual & Recurrent memory)

Memory-as-Context: Memory tokens concatenated with input (image, language, proprioception) tokens.
Memory-as-Modulator: Uses adaptive LayerNorm (AdaLN). Action features cross-attend to memory tokens to produce scale ( $\gamma$ ) and shift ( $\beta$ ) parameters that modulate normalized action features.
Memory-as-Expert: Adds a dedicated, lightweight memory expert. Experts interact via block-wise causal attention (action expert attends to VLM and memory experts).

Evaluation Setup:

Models: 14 MME-VLA variants + 4 prior methods ( $\pi_{0.5}$ , $\pi_{0.5}$ w/ past actions, SAM2Act+, MemER).
Memory Budget: Fixed at 512 tokens for fair comparison.
Training: Multi-task setup (single model across all tasks).
Evaluation: 50 episodes per task, averaged over 9 runs (3 checkpoints × 3 seeds).

Empirical Validation / Results

The main results are presented in Table 3. Key analyses address six research questions (Q1-Q6):

Table 3: Main Results (Success Rates, %) - Excerpt of Key Models

Method	BinFill	PickXtimes	StopCube	...	AVG
Human Performance	96.00	100.0	78.00	...	90.50
MME-VLA w/ Symbolic Memory
`GroundSG+Oracle` (Upper Bound)	85.78	100.0	49.67	...	84.08
`GroundSG+QwenVL`	52.00	92.67	0.00	...	32.70
`SimpleSG+QwenVL`	77.56	95.33	0.44	...	29.00
MME-VLA w/ Perceptual Memory
`FrameSamp+Modul` (Best Overall)	39.56	87.33	42.00	...	44.51
`TokenDrop+Modul`	34.44	83.56	5.33	...	38.04
Other Methods
`MemER`	56.67	79.33	0.00	...	42.38
$\pi_{0.5}$ (No Memory)	30.00	42.89	6.67	...	17.93

Note: Red indicates best in section. $\blacksquare$ indicates overall best for non-oracle models. See paper for full table.

Q1: Best performing representation & integration?

Perceptual memory methods perform best overall. FrameSamp+Modul achieves the highest average success (44.51%).
FrameSamp outperforms TokenDrop, likely because aggressive token pruning removes crucial global spatial context.
Memory-as-Modulator is the most effective integration strategy for perceptual memory, offering a good balance of performance and architectural preservation.

Q2: Is symbolic reasoning sufficient?

No. While the oracle-bound GroundSG+Oracle solves many tasks (84.08%), it struggles with manipulation-intensive (e.g., StopCube, InsertPeg) and cluttered scene tasks, where precise visuomotor control is the bottleneck.

Q3: Human performance?

Humans achieve 90.5% success via a VideoQA setup with oracle low-level control, but still fail on long-horizon and time-sensitive tasks, confirming RoboMME's inherent challenge.

Q4: Task-dependent effectiveness?

Yes, strongly. As shown in Figure 3, different memory designs excel on different task characteristics:
- Symbolic memory (GroundSG+QwenVL) excels at short-horizon and event-salient tasks.
- Perceptual memory (FrameSamp+Modul) excels at motion-centric, time-sensitive, and long-horizon video reasoning tasks.
- MemER (hybrid) excels at dynamic scene-change tasks.

Q5: Efficiency-performance trade-off?

Perceptual memory offers the best balance. As memory budget increases, FrameSamp+Modul shows consistent performance gains with modest computational increase. Methods relying on external VLM inference (GroundSG+QwenVL, MemER) incur 3-5x higher compute costs.

Q6: Real-world transfer?

Yes. Experiments on 4 physical tasks mirroring RoboMME challenges show similar trends (Table 4).

Table 4: Real-World Experiment Results (Successes/10 trials)

Method	PutFruits (Counting)	TrackCube (Spatial)	RepickBlock (Object)	DrawPattern (Procedural)	Total
$\pi_{0.5}$	2	1	1	0	4/40
`GroundSG+QwenVL`	9	3	5	2	19/40
`FrameSamp+Modul`	6	5	6	8	25/40

Theoretical and Practical Implications

Theoretical: Provides a cognitively-grounded taxonomy (Temporal, Spatial, Object, Procedural) for structuring research into memory for robotics. Demonstrates that effective memory design is not one-size-fits-all but must be matched to task demands.
Practical: Introduces RoboMME as a standardized benchmark to enable systematic comparison and progress measurement for memory-augmented policies. The MME-VLA suite serves as a controlled testbed for ablating memory designs. Findings guide practitioners: use symbolic memory for high-level reasoning/counting, perceptual memory for motion/ time-sensitive tasks, and Memory-as-Modulator for efficient integration.

Conclusion

RoboMME establishes a comprehensive benchmark and evaluation framework for memory in robotic manipulation. Key conclusions:

Memory is critical for long-horizon, history-dependent tasks, and no single memory representation dominates all scenarios.
Task demands and memory design are interdependent. Symbolic and perceptual memory offer complementary strengths.
The FrameSamp+Modul (perceptual memory) variant offers the best overall performance-efficiency balance among the tested designs.
Recurrent memory underperformed in this study, suggesting need for deeper architectural integration or recurrence-oriented pretraining.

Future Work includes extending RoboMME to mobile manipulation, exploring other VLA backbones, and developing unified frameworks that integrate multiple complementary memory representations. RoboMME is positioned as a foundation for advancing reliable, memory-augmented robotic generalist agents.