RLDX-1 Technical Report: A Vision-Language-Action Model for Dexterous Manipulation
Summary (Overview)
- Unified Architecture: RLDX-1 introduces the Multi-Stream Action Transformer (MSAT), a novel architecture that integrates motion awareness, long-term memory, and physical sensing into a single Vision-Language-Action (VLA) model for dexterous manipulation.
- Three-Stage Training: The model is trained through a progressive pipeline: pre-training on diverse multi-embodiment data, mid-training for embodiment-specific functional capabilities, and post-training for task adaptation, optionally enhanced with reinforcement learning.
- Synthetic Data Pipeline: A novel framework generates and filters synthetic robot data to augment rare manipulation scenarios, improving scene and task diversity and enhancing downstream policy performance.
- Inference Optimization: A two-stage optimization (static graph conversion and custom kernel fusion) achieves a >1.6× speedup, reducing per-step latency to ~43.7 ms, enabling real-time deployment.
- State-of-the-Art Performance: RLDX-1 consistently outperforms frontier VLAs (π0.5, GR00T N1.6) across simulation benchmarks and real-world tasks, particularly excelling in tasks requiring motion awareness, long-term memory, and physical sensing.
Introduction and Theoretical Foundation
The goal of creating generalist robot policies capable of human-like dexterous manipulation in real-world environments remains a central challenge in robotics. Vision-Language-Action models (VLAs) have shown progress by leveraging the versatile intelligence (broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models (VLMs). However, versatility alone is insufficient for many complex real-world tasks, which demand broader functional capabilities:
- Motion Awareness: To operate in dynamic environments (e.g., interacting with moving objects).
- Long-Term Memory: For sequential and long-horizon tasks requiring reasoning over past interactions.
- Physical Sensing: To infer contact forces under occlusion or subtle visual changes (e.g., grasping deformable objects).
RLDX-1 is designed to address this gap. It is a general-purpose robotic policy built on a unified neural architecture (MSAT) that integrates these heterogeneous modalities. The system combines this architecture with key design choices: a synthetic data generation pipeline for rare scenarios, a specialized three-stage training procedure, and inference optimizations for real-time control. The theoretical foundation rests on extending the capabilities of VLAs beyond static scene understanding to handle the temporal, memory-dependent, and contact-rich nature of real-world manipulation.
Methodology
1. Neural Architecture
RLDX-1 consists of two main components: a temporally-aware Vision-Language Model (VLM) and a multimodal action model.
Vision-Language Model (RLDX-1-VLM):
- Base Model: Built upon Qwen3-VL 8B, fine-tuned on a robot-specific Visual Question Answering (VQA) dataset to improve embodied grounding (spatial relationships, subtask inference, low-level action alignment).
- Cognition Tokens: Learnable query tokens are appended to the input sequence to extract action-relevant cognition features from the VLM's intermediate layers.
- Motion Awareness: Integrated via a motion module that captures temporal dynamics. Given video features from layer , the module computes a space-time self-similarity tensor and updates features as:
Multi-frame observations are compressed into a single context token via average pooling after early LLM layers for efficiency.
- Long-Term Memory: An explicit memory module maintains a queue of past cognition features. A Transformer processes to produce memory features .
Action Model (Multi-Stream Action Transformer - MSAT):
- Formulation: A flow-matching Diffusion Transformer (DiT) that generates a chunk of future actions . Given noisy action chunk and conditioning inputs , it learns a velocity field via the flow-matching objective:
During inference, actions are generated over denoising steps using Euler's method:
- MSAT Design: Extends the Multi-Modal Diffusion Transformer (MM-DiT) to action modeling. It processes heterogeneous modalities through dedicated streams (Cognition , Action , Physics ) coupled via joint self-attention. Streams apply their own normalization and QKV projections; outputs are concatenated for joint attention then split back.
- Physics Stream: Handles physical signals (tactile, torque) with an auxiliary objective to predict future signals , encouraging the model to internalize physical interaction dynamics.
- Design Choices: Uses Rotary Positional Embeddings (RoPE) on the Action stream, injects the flow-matching timestep as an in-context token, and employs RMSNorm and SwiGLU activation.
2. Training Data
RLDX-1 uses three complementary data sources:
- Public Real-World Data: Curated from datasets like Open-X-Embodiment (OXE), DROID, Galaxea Open-World, Agibot World, Fourier ActionNet, and Humanoid Everyday (~1.5M episodes).
- In-house Real-World Data: Collected on the ALLEX humanoid (48-DoF) and a sensor-augmented Franka Research 3 (FR3) platform with tactile (AnySkin) and torque sensors.
- Synthetic Data: Generated via a pipeline (Fig. 4) to augment rare scenarios:
- Generation: Source demonstrations are diversified via scene augmentation (I2I editing of initial frames) and task augmentation (VLM-generated instructions). Image-to-video (I2V) models generate videos, which are annotated with actions via an Inverse Dynamics Model (IDM).
- Filtering: Two-stage filtering improves quality:
- Video Quality Filtering: A VLM evaluates instruction following and trajectory plausibility.
- Motion-Consistency Filtering: IDM-predicted actions are replayed in a simulator; a consistency classifier compares the rollout video to the synthetic video, retaining only high-scoring samples.
3. Training Procedure
A three-stage pipeline progressively specializes the policy.
Pre-Training: Trained on a large-scale multi-embodiment dataset (Fig. 6) for 100K steps to learn general action-prediction capabilities. Uses embodiment-specific projection layers and an embodiment-agnostic layer for generalization.
Mid-Training: Adapts the pre-trained model to target platforms (ALLEX, FR3) and injects functional capabilities (Fig. 7). Combines in-house data with synthetic data. Integrates the motion module, memory module (), and physics stream. Training: 25K steps with modality dropout and an alignment warmup.
Post-Training: Specializes the model for downstream tasks via:
- Adaptive Data Collection: Iterative protocol starting with a base dataset (balanced consistency/variance) and refining based on observed failure modes.
- Reinforcement Learning (RECAP): Uses a text-based VLM critic that predicts values autoregressively using the VLM's native number tokens, enabling reliable value estimation from limited data.
4. Inference Strategy
Optimizations reduce per-step latency for real-time control.
Graph Capture Optimization: Converts the model into a static graph by precomputing constant tensors (RoPE embeddings, attention masks) and capturing the entire forward pass as a single CUDA Graph, eliminating kernel launch overhead and graph fragmentation.
Kernel Optimization: Designs custom fused kernels for critical operator groups (e.g., RMSNorm + RoPE + Attention) to minimize memory traffic and coordinate data movement within a single kernel, overcoming limitations of Torch Compile's fixed fusion patterns.
Empirical Validation / Results
Simulation Benchmarks
RLDX-1 was evaluated on a diverse suite of benchmarks (Fig. 11) and compared against frontier VLAs (π0-FAST, π0, π0.5, GR00T N1.5, GR00T N1.6).
Table 1: Results on Simulation Benchmarks
(a) Classical Simulation Benchmarks
| Method | LIBERO Short | LIBERO Long | LIBERO Avg. | LIBERO-Plus | SIMPLER Google-VM | SIMPLER Google-VA | SIMPLER WidowX |
|---|---|---|---|---|---|---|---|
| π0-FAST | 93.9 | 60.2 | 85.5 | 64.2 | 61.9 | 59.0 | 48.3 |
| π0 | 97.1 | 85.2 | 94.1 | 54.6 | 58.8 | 54.8 | 27.1 |
| π0.5 | 98.0 | 92.0 | 96.9 | 86.5 | 72.7 | 68.4 | 46.9 |
| GR00T N1.5 | 90.0 | 76.0 | 86.5 | 66.3 | 52.4 | 43.7 | 62.0 |
| GR00T N1.6 | 97.4 | 94.4 | 96.7 | 72.6 | 76.1 | 57.1 | 57.1 |
| RLDX-1 (Ours) | 98.6 | 95.3 | 97.8 | 86.7 | 81.5 | 77.4 | 71.9 |
(b) Challenging Simulation Benchmarks
| Method | RoboCasa Kitchen | GR-1 Tabletop | RoboCasa365 Atomic-S | RoboCasa365 Comp.-S | RoboCasa365 Comp.-U | RoboCasa365 Avg. |
|---|---|---|---|---|---|---|
| π0-FAST | 63.6 | - | 51.7 | 8.0 | 1.8 | 21.7 |
| π0 | 62.5 | 13.6 | 34.6 | 6.1 | 1.1 | 14.8 |
| π0.5 | 62.1 | 15.4 | 39.6 | 7.1 | 1.2 | 16.9 |
| GR00T N1.5 | 65.7 | 48.0 | 43.0 | 9.6 | 4.4 | 20.0 |
| GR00T N1.6 | 66.2 | 47.6 | 61.1 | 12.6 | 2.6 | 26.9 |
| RLDX-1 (Ours) | 70.6 | 58.7 | 67.3 | 19.0 | 5.6 | 32.1 |
Key Findings: RLDX-1 consistently outperforms baselines across all benchmarks, demonstrating superior versatility and robustness. The advantage is particularly pronounced on challenging benchmarks (GR-1 Tabletop, RoboCasa365) and under robustness shifts (LIBERO-Plus, SIMPLER Google-VA).
Real-World Experiments
OpenArm Humanoid Benchmark (Versatile Intelligence): Evaluated on tasks requiring basic manipulation, instruction following, and generalization (Fig. 13).
Figure 14: OpenArm Humanoid Benchmark Results RLDX-1 substantially outperforms baselines across all tasks, showing strong generalization. For example:
- Unseen Object: RLDX-1 achieves 54.2% vs. π0.5's 37.5%.
- Object Grounding: RLDX-1 achieves 87.5% vs. GR00T N1.6's 33.3%.
ALLEX Humanoid Benchmark (Functional Capabilities): Evaluated on tasks requiring motion awareness, long-term memory, and physical sensing (Fig. 15).
Figure 16: ALLEX Humanoid Benchmark Results
| Task Category | π0.5 (%) | GR00T N1.6 (%) | RLDX-1 (Ours) (%) |
|---|---|---|---|
| Conveyor Pick-and-Place (Motion) | 29.2 | 33.3 | 87.5 |
| Object-in-Box Selection (Memory) | 38.5 | 29.2 | 91.7 |
| Card Slide-and-Pick (Physical) | 55.3 | 62.3 | 97.2 |
| Pot-to-Cup Pouring (Physical) | 39.1 | 44.8 | 70.8 |
| Average | 39.1 | 44.8 | 86.8 |
RLDX-1 achieves dramatically higher success rates, demonstrating the effectiveness of its integrated functional capabilities. Baselines struggle with dynamic environments, memory-dependent choices, and contact-rich tasks.
Franka Research 3 Benchmark (Functional Capabilities): Evaluated on similar capability-specific tasks (Fig. 17).
Figure 18: Franka Research 3 Benchmark Results RLDX-1 again substantially outperforms baselines. For example:
- Spin Tracking (Motion): 97.9% vs. ~30% for baselines.
- Shell Game (Memory): 91.7% vs. ~50% for baselines.
- Plug Insertion (Physical): 33.3% vs. ~20% for baselines.
Ablation and Analysis
VLM Design Choices (Table 2):
- Layer Selection: Using intermediate layer (18) features yields the best performance (60.9% on RoboCasa Kitchen). Earlier (8) or later (28) layers degrade performance.
- Robot-Specific VQA Training: Improves success rate from 57.5% to 60.9%. Attention maps show increased focus on robot embodiment and target objects after training.
Effect of Synthetic Data (Table 3): Pre-training with synthetic GR-1 humanoid data consistently improves downstream performance on the GR-1 Tabletop benchmark:
| Pre-training Data (Real + Synthetic %) | Success Rate (%) |
|---|---|
| Real only (0%) | 41.0 |
| Real + 25% Synthetic | 45.6 |
| Real + 50% Synthetic | 46.6 |
| Real + 100% Synthetic | 50.1 |
Effect of RL Application (Figure 21): On the challenging Light Bulb Twisting task, RECAP-based RL refinement significantly improves over Behavior Cloning (BC):
- Episode Length: RECAP₃ completes in ~353 frames vs. BC's ~1056 frames.
- Attempts: RECAP₃ uses ~4.1 attempts vs. BC's ~12.7 attempts. RECAP even surpasses human teleoperation performance, demonstrating improved speed and robustness.
Inference Optimization (Table 4):
| Inference Stack | w/o physics & memory (ms) | All-modality (ms) | Speedup vs. Eager |
|---|---|---|---|
| PyTorch Eager | 67.0 | 71.2 | 1.00× |
| CUDA Graph + Torch.Compile | 56.9 | 59.6 | ~1.19× |
| + Static Graph Conversion | 46.2 | 48.9 | ~1.46× |
| + Kernel Optimization | 41.6 | 43.7 | ~1.63× |
The two-stage optimization achieves a >1.6× speedup, reducing latency to ~43.7 ms, enabling real-time control.
Theoretical and Practical Implications
Theoretical Implications:
- Beyond Versatile Intelligence: The work argues that for real-world dexterous manipulation, VLAs must integrate explicit functional capabilities (motion, memory, physical sensing) alongside general scene understanding. RLDX-1 provides a unified architectural framework (MSAT) to achieve this.
- Architectural Integration: MSAT demonstrates how heterogeneous modalities can be effectively integrated via modality-specific streams with joint self-attention, preserving modality-specific representations while enabling cross-modal interaction for action generation.
- Data Scaling: The synthetic data pipeline shows that generative models can effectively augment scarce robot data, particularly for specialized embodiments and tasks, by combining scene/task augmentation with rigorous filtering (quality and motion-consistency).
Practical Implications:
- Real-World Deployment: The inference optimizations (graph capture, kernel fusion) make high-DoF, multi-modal VLAs practical for real-time control, addressing a key bottleneck for deployment.
- Training Efficiency: The three-stage training pipeline (pre-training → mid-training → post-training) provides a structured approach to build generalist policies that can be efficiently specialized for specific embodiments and tasks.
- Performance Gains: The significant performance improvements over state-of-the-art VLAs, especially on functional-capability-specific tasks (e.g., >90% success on ALLEX memory tasks vs. ~30% for baselines), demonstrate the practical value of the integrated design for complex, contact-rich, and dynamic manipulation.
Conclusion
RLDX-1 represents a significant step toward general-purpose robot policies capable of human-like dexterous manipulation in complex real-world environments. By unifying motion awareness, long-term memory, and physical sensing into a single VLA architecture, and combining it with scalable synthetic data generation, specialized training, and real-time inference