RLDX-1 Technical Report: A Vision-Language-Action Model for Dexterous Manipulation

Summary (Overview)

  • Unified Architecture: RLDX-1 introduces the Multi-Stream Action Transformer (MSAT), a novel architecture that integrates motion awareness, long-term memory, and physical sensing into a single Vision-Language-Action (VLA) model for dexterous manipulation.
  • Three-Stage Training: The model is trained through a progressive pipeline: pre-training on diverse multi-embodiment data, mid-training for embodiment-specific functional capabilities, and post-training for task adaptation, optionally enhanced with reinforcement learning.
  • Synthetic Data Pipeline: A novel framework generates and filters synthetic robot data to augment rare manipulation scenarios, improving scene and task diversity and enhancing downstream policy performance.
  • Inference Optimization: A two-stage optimization (static graph conversion and custom kernel fusion) achieves a >1.6× speedup, reducing per-step latency to ~43.7 ms, enabling real-time deployment.
  • State-of-the-Art Performance: RLDX-1 consistently outperforms frontier VLAs (π0.5, GR00T N1.6) across simulation benchmarks and real-world tasks, particularly excelling in tasks requiring motion awareness, long-term memory, and physical sensing.

Introduction and Theoretical Foundation

The goal of creating generalist robot policies capable of human-like dexterous manipulation in real-world environments remains a central challenge in robotics. Vision-Language-Action models (VLAs) have shown progress by leveraging the versatile intelligence (broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models (VLMs). However, versatility alone is insufficient for many complex real-world tasks, which demand broader functional capabilities:

  • Motion Awareness: To operate in dynamic environments (e.g., interacting with moving objects).
  • Long-Term Memory: For sequential and long-horizon tasks requiring reasoning over past interactions.
  • Physical Sensing: To infer contact forces under occlusion or subtle visual changes (e.g., grasping deformable objects).

RLDX-1 is designed to address this gap. It is a general-purpose robotic policy built on a unified neural architecture (MSAT) that integrates these heterogeneous modalities. The system combines this architecture with key design choices: a synthetic data generation pipeline for rare scenarios, a specialized three-stage training procedure, and inference optimizations for real-time control. The theoretical foundation rests on extending the capabilities of VLAs beyond static scene understanding to handle the temporal, memory-dependent, and contact-rich nature of real-world manipulation.

Methodology

1. Neural Architecture

RLDX-1 consists of two main components: a temporally-aware Vision-Language Model (VLM) and a multimodal action model.

Vision-Language Model (RLDX-1-VLM):

  • Base Model: Built upon Qwen3-VL 8B, fine-tuned on a robot-specific Visual Question Answering (VQA) dataset to improve embodied grounding (spatial relationships, subtask inference, low-level action alignment).
  • Cognition Tokens: Learnable query tokens qq are appended to the input sequence x=[vt,lt,q]x = [v_t, l_t, q] to extract action-relevant cognition features hth_t from the VLM's intermediate layers.
  • Motion Awareness: Integrated via a motion module that captures temporal dynamics. Given video features vt(i)v_t^{(i)} from layer ii, the module computes a space-time self-similarity tensor StS_t and updates features as:
v~t(i)=vt(i)+Sθ(St)\tilde{v}_t^{(i)} = v_t^{(i)} + S_\theta(S_t)

Multi-frame observations are compressed into a single context token via average pooling after early LLM layers for efficiency.

  • Long-Term Memory: An explicit memory module maintains a queue Qt=[htnmemH,,htH]Q_t = [h_{t - n_{mem}H}, \cdots, h_{t-H}] of past cognition features. A Transformer MθM_\theta processes [Qt,ht][Q_t, h_t] to produce memory features mtm_t.

Action Model (Multi-Stream Action Transformer - MSAT):

  • Formulation: A flow-matching Diffusion Transformer (DiT) that generates a chunk of H+1H+1 future actions at:t+Ha_{t:t+H}. Given noisy action chunk at:t+Hτ=τat:t+H+(1τ)ϵa_{t:t+H}^\tau = \tau a_{t:t+H} + (1-\tau)\epsilon and conditioning inputs ct=[ht,mt,st,pt]c_t = [h_t, m_t, s_t, p_t], it learns a velocity field uθu_\theta via the flow-matching objective:
L(θ;t,τ,ϵ)=uθ(at:t+Hτ,τ,ct)(at:t+Hϵ)22L(\theta; t, \tau, \epsilon) = \| u_\theta(a_{t:t+H}^\tau, \tau, c_t) - (a_{t:t+H} - \epsilon) \|_2^2

During inference, actions are generated over TT denoising steps using Euler's method:

at:t+Hτi+1=at:t+Hτi+(τi+1τi)uθ(at:t+Hτi,τi,ct),i=1,,T1a_{t:t+H}^{\tau_{i+1}} = a_{t:t+H}^{\tau_i} + (\tau_{i+1} - \tau_i) u_\theta(a_{t:t+H}^{\tau_i}, \tau_i, c_t), \quad i=1,\ldots,T-1
  • MSAT Design: Extends the Multi-Modal Diffusion Transformer (MM-DiT) to action modeling. It processes heterogeneous modalities through dedicated streams (Cognition CC, Action AA, Physics PP) coupled via joint self-attention. Streams apply their own normalization and QKV projections; outputs are concatenated for joint attention then split back.
  • Physics Stream: Handles physical signals ptp_t (tactile, torque) with an auxiliary objective to predict future signals pt+1:t+Lp_{t+1:t+L}, encouraging the model to internalize physical interaction dynamics.
  • Design Choices: Uses Rotary Positional Embeddings (RoPE) on the Action stream, injects the flow-matching timestep τ\tau as an in-context token, and employs RMSNorm and SwiGLU activation.

2. Training Data

RLDX-1 uses three complementary data sources:

  1. Public Real-World Data: Curated from datasets like Open-X-Embodiment (OXE), DROID, Galaxea Open-World, Agibot World, Fourier ActionNet, and Humanoid Everyday (~1.5M episodes).
  2. In-house Real-World Data: Collected on the ALLEX humanoid (48-DoF) and a sensor-augmented Franka Research 3 (FR3) platform with tactile (AnySkin) and torque sensors.
  3. Synthetic Data: Generated via a pipeline (Fig. 4) to augment rare scenarios:
    • Generation: Source demonstrations are diversified via scene augmentation (I2I editing of initial frames) and task augmentation (VLM-generated instructions). Image-to-video (I2V) models generate videos, which are annotated with actions via an Inverse Dynamics Model (IDM).
    • Filtering: Two-stage filtering improves quality:
      • Video Quality Filtering: A VLM evaluates instruction following and trajectory plausibility.
      • Motion-Consistency Filtering: IDM-predicted actions are replayed in a simulator; a consistency classifier compares the rollout video to the synthetic video, retaining only high-scoring samples.

3. Training Procedure

A three-stage pipeline progressively specializes the policy.

Pre-Training: Trained on a large-scale multi-embodiment dataset (Fig. 6) for 100K steps to learn general action-prediction capabilities. Uses embodiment-specific projection layers and an embodiment-agnostic layer for generalization.

Mid-Training: Adapts the pre-trained model to target platforms (ALLEX, FR3) and injects functional capabilities (Fig. 7). Combines in-house data with synthetic data. Integrates the motion module, memory module (nmem=3n_{mem}=3), and physics stream. Training: 25K steps with modality dropout and an alignment warmup.

Post-Training: Specializes the model for downstream tasks via:

  • Adaptive Data Collection: Iterative protocol starting with a base dataset (balanced consistency/variance) and refining based on observed failure modes.
  • Reinforcement Learning (RECAP): Uses a text-based VLM critic that predicts values autoregressively using the VLM's native number tokens, enabling reliable value estimation from limited data.

4. Inference Strategy

Optimizations reduce per-step latency for real-time control.

Graph Capture Optimization: Converts the model into a static graph by precomputing constant tensors (RoPE embeddings, attention masks) and capturing the entire forward pass as a single CUDA Graph, eliminating kernel launch overhead and graph fragmentation.

Kernel Optimization: Designs custom fused kernels for critical operator groups (e.g., RMSNorm + RoPE + Attention) to minimize memory traffic and coordinate data movement within a single kernel, overcoming limitations of Torch Compile's fixed fusion patterns.

Empirical Validation / Results

Simulation Benchmarks

RLDX-1 was evaluated on a diverse suite of benchmarks (Fig. 11) and compared against frontier VLAs (π0-FAST, π0, π0.5, GR00T N1.5, GR00T N1.6).

Table 1: Results on Simulation Benchmarks

(a) Classical Simulation Benchmarks

MethodLIBERO ShortLIBERO LongLIBERO Avg.LIBERO-PlusSIMPLER Google-VMSIMPLER Google-VASIMPLER WidowX
π0-FAST93.960.285.564.261.959.048.3
π097.185.294.154.658.854.827.1
π0.598.092.096.986.572.768.446.9
GR00T N1.590.076.086.566.352.443.762.0
GR00T N1.697.494.496.772.676.157.157.1
RLDX-1 (Ours)98.695.397.886.781.577.471.9

(b) Challenging Simulation Benchmarks

MethodRoboCasa KitchenGR-1 TabletopRoboCasa365 Atomic-SRoboCasa365 Comp.-SRoboCasa365 Comp.-URoboCasa365 Avg.
π0-FAST63.6-51.78.01.821.7
π062.513.634.66.11.114.8
π0.562.115.439.67.11.216.9
GR00T N1.565.748.043.09.64.420.0
GR00T N1.666.247.661.112.62.626.9
RLDX-1 (Ours)70.658.767.319.05.632.1

Key Findings: RLDX-1 consistently outperforms baselines across all benchmarks, demonstrating superior versatility and robustness. The advantage is particularly pronounced on challenging benchmarks (GR-1 Tabletop, RoboCasa365) and under robustness shifts (LIBERO-Plus, SIMPLER Google-VA).

Real-World Experiments

OpenArm Humanoid Benchmark (Versatile Intelligence): Evaluated on tasks requiring basic manipulation, instruction following, and generalization (Fig. 13).

Figure 14: OpenArm Humanoid Benchmark Results RLDX-1 substantially outperforms baselines across all tasks, showing strong generalization. For example:

  • Unseen Object: RLDX-1 achieves 54.2% vs. π0.5's 37.5%.
  • Object Grounding: RLDX-1 achieves 87.5% vs. GR00T N1.6's 33.3%.

ALLEX Humanoid Benchmark (Functional Capabilities): Evaluated on tasks requiring motion awareness, long-term memory, and physical sensing (Fig. 15).

Figure 16: ALLEX Humanoid Benchmark Results

Task Categoryπ0.5 (%)GR00T N1.6 (%)RLDX-1 (Ours) (%)
Conveyor Pick-and-Place (Motion)29.233.387.5
Object-in-Box Selection (Memory)38.529.291.7
Card Slide-and-Pick (Physical)55.362.397.2
Pot-to-Cup Pouring (Physical)39.144.870.8
Average39.144.886.8

RLDX-1 achieves dramatically higher success rates, demonstrating the effectiveness of its integrated functional capabilities. Baselines struggle with dynamic environments, memory-dependent choices, and contact-rich tasks.

Franka Research 3 Benchmark (Functional Capabilities): Evaluated on similar capability-specific tasks (Fig. 17).

Figure 18: Franka Research 3 Benchmark Results RLDX-1 again substantially outperforms baselines. For example:

  • Spin Tracking (Motion): 97.9% vs. ~30% for baselines.
  • Shell Game (Memory): 91.7% vs. ~50% for baselines.
  • Plug Insertion (Physical): 33.3% vs. ~20% for baselines.

Ablation and Analysis

VLM Design Choices (Table 2):

  • Layer Selection: Using intermediate layer (18) features yields the best performance (60.9% on RoboCasa Kitchen). Earlier (8) or later (28) layers degrade performance.
  • Robot-Specific VQA Training: Improves success rate from 57.5% to 60.9%. Attention maps show increased focus on robot embodiment and target objects after training.

Effect of Synthetic Data (Table 3): Pre-training with synthetic GR-1 humanoid data consistently improves downstream performance on the GR-1 Tabletop benchmark:

Pre-training Data (Real + Synthetic %)Success Rate (%)
Real only (0%)41.0
Real + 25% Synthetic45.6
Real + 50% Synthetic46.6
Real + 100% Synthetic50.1

Effect of RL Application (Figure 21): On the challenging Light Bulb Twisting task, RECAP-based RL refinement significantly improves over Behavior Cloning (BC):

  • Episode Length: RECAP₃ completes in ~353 frames vs. BC's ~1056 frames.
  • Attempts: RECAP₃ uses ~4.1 attempts vs. BC's ~12.7 attempts. RECAP even surpasses human teleoperation performance, demonstrating improved speed and robustness.

Inference Optimization (Table 4):

Inference Stackw/o physics & memory (ms)All-modality (ms)Speedup vs. Eager
PyTorch Eager67.071.21.00×
CUDA Graph + Torch.Compile56.959.6~1.19×
+ Static Graph Conversion46.248.9~1.46×
+ Kernel Optimization41.643.7~1.63×

The two-stage optimization achieves a >1.6× speedup, reducing latency to ~43.7 ms, enabling real-time control.

Theoretical and Practical Implications

Theoretical Implications:

  • Beyond Versatile Intelligence: The work argues that for real-world dexterous manipulation, VLAs must integrate explicit functional capabilities (motion, memory, physical sensing) alongside general scene understanding. RLDX-1 provides a unified architectural framework (MSAT) to achieve this.
  • Architectural Integration: MSAT demonstrates how heterogeneous modalities can be effectively integrated via modality-specific streams with joint self-attention, preserving modality-specific representations while enabling cross-modal interaction for action generation.
  • Data Scaling: The synthetic data pipeline shows that generative models can effectively augment scarce robot data, particularly for specialized embodiments and tasks, by combining scene/task augmentation with rigorous filtering (quality and motion-consistency).

Practical Implications:

  • Real-World Deployment: The inference optimizations (graph capture, kernel fusion) make high-DoF, multi-modal VLAs practical for real-time control, addressing a key bottleneck for deployment.
  • Training Efficiency: The three-stage training pipeline (pre-training → mid-training → post-training) provides a structured approach to build generalist policies that can be efficiently specialized for specific embodiments and tasks.
  • Performance Gains: The significant performance improvements over state-of-the-art VLAs, especially on functional-capability-specific tasks (e.g., >90% success on ALLEX memory tasks vs. ~30% for baselines), demonstrate the practical value of the integrated design for complex, contact-rich, and dynamic manipulation.

Conclusion

RLDX-1 represents a significant step toward general-purpose robot policies capable of human-like dexterous manipulation in complex real-world environments. By unifying motion awareness, long-term memory, and physical sensing into a single VLA architecture, and combining it with scalable synthetic data generation, specialized training, and real-time inference