# RLDX-1 Technical Report

> RLDX-1 introduces a unified architecture with motion awareness, long-term memory, and physical sensing, achieving state-of-the-art performance in dexterous manipulation.

- **Source:** [arXiv](https://arxiv.org/abs/2605.03269)
- **Published:** 2026-05-08
- **Permalink:** https://picx.dev/p/qn6FuE
- **Whiteboard:** https://picx.dev/p/qn6FuE/image

## Summary

# RLDX-1 Technical Report: A Vision-Language-Action Model for Dexterous Manipulation

## Summary (Overview)
* **Unified Architecture:** RLDX-1 introduces the Multi-Stream Action Transformer (MSAT), a novel architecture that integrates motion awareness, long-term memory, and physical sensing into a single Vision-Language-Action (VLA) model for dexterous manipulation.
* **Three-Stage Training:** The model is trained through a progressive pipeline: pre-training on diverse multi-embodiment data, mid-training for embodiment-specific functional capabilities, and post-training for task adaptation, optionally enhanced with reinforcement learning.
* **Synthetic Data Pipeline:** A novel framework generates and filters synthetic robot data to augment rare manipulation scenarios, improving scene and task diversity and enhancing downstream policy performance.
* **Inference Optimization:** A two-stage optimization (static graph conversion and custom kernel fusion) achieves a >1.6× speedup, reducing per-step latency to ~43.7 ms, enabling real-time deployment.
* **State-of-the-Art Performance:** RLDX-1 consistently outperforms frontier VLAs (π0.5, GR00T N1.6) across simulation benchmarks and real-world tasks, particularly excelling in tasks requiring motion awareness, long-term memory, and physical sensing.

## Introduction and Theoretical Foundation
The goal of creating generalist robot policies capable of human-like dexterous manipulation in real-world environments remains a central challenge in robotics. Vision-Language-Action models (VLAs) have shown progress by leveraging the versatile intelligence (broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models (VLMs). However, **versatility alone is insufficient** for many complex real-world tasks, which demand broader **functional capabilities**:
* **Motion Awareness:** To operate in dynamic environments (e.g., interacting with moving objects).
* **Long-Term Memory:** For sequential and long-horizon tasks requiring reasoning over past interactions.
* **Physical Sensing:** To infer contact forces under occlusion or subtle visual changes (e.g., grasping deformable objects).

RLDX-1 is designed to address this gap. It is a general-purpose robotic policy built on a unified neural architecture (MSAT) that integrates these heterogeneous modalities. The system combines this architecture with key design choices: a synthetic data generation pipeline for rare scenarios, a specialized three-stage training procedure, and inference optimizations for real-time control. The theoretical foundation rests on extending the capabilities of VLAs beyond static scene understanding to handle the temporal, memory-dependent, and contact-rich nature of real-world manipulation.

## Methodology

### 1. Neural Architecture
RLDX-1 consists of two main components: a temporally-aware Vision-Language Model (VLM) and a multimodal action model.

**Vision-Language Model (RLDX-1-VLM):**
* **Base Model:** Built upon Qwen3-VL 8B, fine-tuned on a robot-specific Visual Question Answering (VQA) dataset to improve embodied grounding (spatial relationships, subtask inference, low-level action alignment).
* **Cognition Tokens:** Learnable query tokens $q$ are appended to the input sequence $x = [v_t, l_t, q]$ to extract action-relevant cognition features $h_t$ from the VLM's intermediate layers.
* **Motion Awareness:** Integrated via a motion module that captures temporal dynamics. Given video features $v_t^{(i)}$ from layer $i$, the module computes a space-time self-similarity tensor $S_t$ and updates features as:
$$\tilde{v}_t^{(i)} = v_t^{(i)} + S_\theta(S_t)$$
Multi-frame observations are compressed into a single context token via average pooling after early LLM layers for efficiency.
* **Long-Term Memory:** An explicit memory module maintains a queue $Q_t = [h_{t - n_{mem}H}, \cdots, h_{t-H}]$ of past cognition features. A Transformer $M_\theta$ processes $[Q_t, h_t]$ to produce memory features $m_t$.

**Action Model (Multi-Stream Action Transformer - MSAT):**
* **Formulation:** A flow-matching Diffusion Transformer (DiT) that generates a chunk of $H+1$ future actions $a_{t:t+H}$. Given noisy action chunk $a_{t:t+H}^\tau = \tau a_{t:t+H} + (1-\tau)\epsilon$ and conditioning inputs $c_t = [h_t, m_t, s_t, p_t]$, it learns a velocity field $u_\theta$ via the flow-matching objective:
$$L(\theta; t, \tau, \epsilon) = \| u_\theta(a_{t:t+H}^\tau, \tau, c_t) - (a_{t:t+H} - \epsilon) \|_2^2$$
During inference, actions are generated over $T$ denoising steps using Euler's method:
$$a_{t:t+H}^{\tau_{i+1}} = a_{t:t+H}^{\tau_i} + (\tau_{i+1} - \tau_i) u_\theta(a_{t:t+H}^{\tau_i}, \tau_i, c_t), \quad i=1,\ldots,T-1$$
* **MSAT Design:** Extends the Multi-Modal Diffusion Transformer (MM-DiT) to action modeling. It processes heterogeneous modalities through dedicated streams (Cognition $C$, Action $A$, Physics $P$) coupled via **joint self-attention**. Streams apply their own normalization and QKV projections; outputs are concatenated for joint attention then split back.
* **Physics Stream:** Handles physical signals $p_t$ (tactile, torque) with an auxiliary objective to predict future signals $p_{t+1:t+L}$, encouraging the model to internalize physical interaction dynamics.
* **Design Choices:** Uses Rotary Positional Embeddings (RoPE) on the Action stream, injects the flow-matching timestep $\tau$ as an in-context token, and employs RMSNorm and SwiGLU activation.

### 2. Training Data
RLDX-1 uses three complementary data sources:
1. **Public Real-World Data:** Curated from datasets like Open-X-Embodiment (OXE), DROID, Galaxea Open-World, Agibot World, Fourier ActionNet, and Humanoid Everyday (~1.5M episodes).
2. **In-house Real-World Data:** Collected on the ALLEX humanoid (48-DoF) and a sensor-augmented Franka Research 3 (FR3) platform with tactile (AnySkin) and torque sensors.
3. **Synthetic Data:** Generated via a pipeline (Fig. 4) to augment rare scenarios:
   * **Generation:** Source demonstrations are diversified via **scene augmentation** (I2I editing of initial frames) and **task augmentation** (VLM-generated instructions). Image-to-video (I2V) models generate videos, which are annotated with actions via an Inverse Dynamics Model (IDM).
   * **Filtering:** Two-stage filtering improves quality:
     * **Video Quality Filtering:** A VLM evaluates instruction following and trajectory plausibility.
     * **Motion-Consistency Filtering:** IDM-predicted actions are replayed in a simulator; a consistency classifier compares the rollout video to the synthetic video, retaining only high-scoring samples.

### 3. Training Procedure
A three-stage pipeline progressively specializes the policy.

**Pre-Training:** Trained on a large-scale multi-embodiment dataset (Fig. 6) for 100K steps to learn general action-prediction capabilities. Uses embodiment-specific projection layers and an embodiment-agnostic layer for generalization.

**Mid-Training:** Adapts the pre-trained model to target platforms (ALLEX, FR3) and injects functional capabilities (Fig. 7). Combines in-house data with synthetic data. Integrates the motion module, memory module ($n_{mem}=3$), and physics stream. Training: 25K steps with modality dropout and an alignment warmup.

**Post-Training:** Specializes the model for downstream tasks via:
* **Adaptive Data Collection:** Iterative protocol starting with a base dataset (balanced consistency/variance) and refining based on observed failure modes.
* **Reinforcement Learning (RECAP):** Uses a **text-based VLM critic** that predicts values autoregressively using the VLM's native number tokens, enabling reliable value estimation from limited data.

### 4. Inference Strategy
Optimizations reduce per-step latency for real-time control.

**Graph Capture Optimization:** Converts the model into a **static graph** by precomputing constant tensors (RoPE embeddings, attention masks) and capturing the entire forward pass as a single CUDA Graph, eliminating kernel launch overhead and graph fragmentation.

**Kernel Optimization:** Designs **custom fused kernels** for critical operator groups (e.g., RMSNorm + RoPE + Attention) to minimize memory traffic and coordinate data movement within a single kernel, overcoming limitations of Torch Compile's fixed fusion patterns.

## Empirical Validation / Results

### Simulation Benchmarks
RLDX-1 was evaluated on a diverse suite of benchmarks (Fig. 11) and compared against frontier VLAs (π0-FAST, π0, π0.5, GR00T N1.5, GR00T N1.6).

**Table 1: Results on Simulation Benchmarks**

**(a) Classical Simulation Benchmarks**
| Method | LIBERO Short | LIBERO Long | LIBERO Avg. | LIBERO-Plus | SIMPLER Google-VM | SIMPLER Google-VA | SIMPLER WidowX |
|---------|---------------|--------------|--------------|--------------|---------------------|---------------------|------------------|
| π0-FAST | 93.9 | 60.2 | 85.5 | 64.2 | 61.9 | 59.0 | 48.3 |
| π0 | 97.1 | 85.2 | 94.1 | 54.6 | 58.8 | 54.8 | 27.1 |
| π0.5 | 98.0 | 92.0 | 96.9 | 86.5 | 72.7 | 68.4 | 46.9 |
| GR00T N1.5 | 90.0 | 76.0 | 86.5 | 66.3 | 52.4 | 43.7 | 62.0 |
| GR00T N1.6 | 97.4 | 94.4 | 96.7 | 72.6 | 76.1 | 57.1 | 57.1 |
| **RLDX-1 (Ours)** | **98.6** | **95.3** | **97.8** | **86.7** | **81.5** | **77.4** | **71.9** |

**(b) Challenging Simulation Benchmarks**
| Method | RoboCasa Kitchen | GR-1 Tabletop | RoboCasa365 Atomic-S | RoboCasa365 Comp.-S | RoboCasa365 Comp.-U | RoboCasa365 Avg. |
|---------|-------------------|---------------|------------------------|------------------------|------------------------|-------------------|
| π0-FAST | 63.6 | - | 51.7 | 8.0 | 1.8 | 21.7 |
| π0 | 62.5 | 13.6 | 34.6 | 6.1 | 1.1 | 14.8 |
| π0.5 | 62.1 | 15.4 | 39.6 | 7.1 | 1.2 | 16.9 |
| GR00T N1.5 | 65.7 | 48.0 | 43.0 | 9.6 | 4.4 | 20.0 |
| GR00T N1.6 | 66.2 | 47.6 | 61.1 | 12.6 | 2.6 | 26.9 |
| **RLDX-1 (Ours)** | **70.6** | **58.7** | **67.3** | **19.0** | **5.6** | **32.1** |

**Key Findings:** RLDX-1 consistently outperforms baselines across all benchmarks, demonstrating superior versatility and robustness. The advantage is particularly pronounced on challenging benchmarks (GR-1 Tabletop, RoboCasa365) and under robustness shifts (LIBERO-Plus, SIMPLER Google-VA).

### Real-World Experiments

**OpenArm Humanoid Benchmark (Versatile Intelligence):** Evaluated on tasks requiring basic manipulation, instruction following, and generalization (Fig. 13).

**Figure 14: OpenArm Humanoid Benchmark Results**
RLDX-1 substantially outperforms baselines across all tasks, showing strong generalization. For example:
* **Unseen Object:** RLDX-1 achieves 54.2% vs. π0.5's 37.5%.
* **Object Grounding:** RLDX-1 achieves 87.5% vs. GR00T N1.6's 33.3%.

**ALLEX Humanoid Benchmark (Functional Capabilities):** Evaluated on tasks requiring motion awareness, long-term memory, and physical sensing (Fig. 15).

**Figure 16: ALLEX Humanoid Benchmark Results**
| Task Category | π0.5 (%) | GR00T N1.6 (%) | RLDX-1 (Ours) (%) |
|---------------|----------|----------------|-------------------|
| Conveyor Pick-and-Place (Motion) | 29.2 | 33.3 | **87.5** |
| Object-in-Box Selection (Memory) | 38.5 | 29.2 | **91.7** |
| Card Slide-and-Pick (Physical) | 55.3 | 62.3 | **97.2** |
| Pot-to-Cup Pouring (Physical) | 39.1 | 44.8 | **70.8** |
| **Average** | **39.1** | **44.8** | **86.8** |

RLDX-1 achieves dramatically higher success rates, demonstrating the effectiveness of its integrated functional capabilities. Baselines struggle with dynamic environments, memory-dependent choices, and contact-rich tasks.

**Franka Research 3 Benchmark (Functional Capabilities):** Evaluated on similar capability-specific tasks (Fig. 17).

**Figure 18: Franka Research 3 Benchmark Results**
RLDX-1 again substantially outperforms baselines. For example:
* **Spin Tracking (Motion):** 97.9% vs. ~30% for baselines.
* **Shell Game (Memory):** 91.7% vs. ~50% for baselines.
* **Plug Insertion (Physical):** 33.3% vs. ~20% for baselines.

### Ablation and Analysis

**VLM Design Choices (Table 2):**
* **Layer Selection:** Using intermediate layer (18) features yields the best performance (60.9% on RoboCasa Kitchen). Earlier (8) or later (28) layers degrade performance.
* **Robot-Specific VQA Training:** Improves success rate from 57.5% to 60.9%. Attention maps show increased focus on robot embodiment and target objects after training.

**Effect of Synthetic Data (Table 3):**
Pre-training with synthetic GR-1 humanoid data consistently improves downstream performance on the GR-1 Tabletop benchmark:
| Pre-training Data (Real + Synthetic %) | Success Rate (%) |
|---------------------------------------|------------------|
| Real only (0%) | 41.0 |
| Real + 25% Synthetic | 45.6 |
| Real + 50% Synthetic | 46.6 |
| Real + 100% Synthetic | **50.1** |

**Effect of RL Application (Figure 21):**
On the challenging **Light Bulb Twisting** task, RECAP-based RL refinement significantly improves over Behavior Cloning (BC):
* **Episode Length:** RECAP₃ completes in ~353 frames vs. BC's ~1056 frames.
* **Attempts:** RECAP₃ uses ~4.1 attempts vs. BC's ~12.7 attempts.
RECAP even surpasses human teleoperation performance, demonstrating improved speed and robustness.

**Inference Optimization (Table 4):**
| Inference Stack | w/o physics & memory (ms) | All-modality (ms) | Speedup vs. Eager |
|-----------------|---------------------------|-------------------|--------------------|
| PyTorch Eager | 67.0 | 71.2 | 1.00× |
| CUDA Graph + Torch.Compile | 56.9 | 59.6 | ~1.19× |
| + Static Graph Conversion | 46.2 | 48.9 | ~1.46× |
| + Kernel Optimization | **41.6** | **43.7** | **~1.63×** |

The two-stage optimization achieves a >1.6× speedup, reducing latency to ~43.7 ms, enabling real-time control.

## Theoretical and Practical Implications

**Theoretical Implications:**
* **Beyond Versatile Intelligence:** The work argues that for real-world dexterous manipulation, VLAs must integrate explicit functional capabilities (motion, memory, physical sensing) alongside general scene understanding. RLDX-1 provides a unified architectural framework (MSAT) to achieve this.
* **Architectural Integration:** MSAT demonstrates how heterogeneous modalities can be effectively integrated via modality-specific streams with joint self-attention, preserving modality-specific representations while enabling cross-modal interaction for action generation.
* **Data Scaling:** The synthetic data pipeline shows that generative models can effectively augment scarce robot data, particularly for specialized embodiments and tasks, by combining scene/task augmentation with rigorous filtering (quality and motion-consistency).

**Practical Implications:**
* **Real-World Deployment:** The inference optimizations (graph capture, kernel fusion) make high-DoF, multi-modal VLAs practical for real-time control, addressing a key bottleneck for deployment.
* **Training Efficiency:** The three-stage training pipeline (pre-training → mid-training → post-training) provides a structured approach to build generalist policies that can be efficiently specialized for specific embodiments and tasks.
* **Performance Gains:** The significant performance improvements over state-of-the-art VLAs, especially on functional-capability-specific tasks (e.g., >90% success on ALLEX memory tasks vs. ~30% for baselines), demonstrate the practical value of the integrated design for complex, contact-rich, and dynamic manipulation.

## Conclusion
RLDX-1 represents a significant step toward general-purpose robot policies capable of human-like dexterous manipulation in complex real-world environments. By unifying motion awareness, long-term memory, and physical sensing into a single VLA architecture, and combining it with scalable synthetic data generation, specialized training, and real-time inference

---

_Markdown view of https://picx.dev/p/qn6FuE, served by PicX — AI-generated visual whiteboard summaries of research papers._