RLDX-1 Technical Report: A Vision-Language-Action Model for Dexterous Manipulation

Summary (Overview)

Unified Architecture: RLDX-1 introduces the Multi-Stream Action Transformer (MSAT), a novel architecture that integrates motion awareness, long-term memory, and physical sensing into a single Vision-Language-Action (VLA) model for dexterous manipulation.
Three-Stage Training: The model is trained through a progressive pipeline: pre-training on diverse multi-embodiment data, mid-training for embodiment-specific functional capabilities, and post-training for task adaptation, optionally enhanced with reinforcement learning.
Synthetic Data Pipeline: A novel framework generates and filters synthetic robot data to augment rare manipulation scenarios, improving scene and task diversity and enhancing downstream policy performance.
Inference Optimization: A two-stage optimization (static graph conversion and custom kernel fusion) achieves a >1.6× speedup, reducing per-step latency to ~43.7 ms, enabling real-time deployment.
State-of-the-Art Performance: RLDX-1 consistently outperforms frontier VLAs (π0.5, GR00T N1.6) across simulation benchmarks and real-world tasks, particularly excelling in tasks requiring motion awareness, long-term memory, and physical sensing.

Introduction and Theoretical Foundation

The goal of creating generalist robot policies capable of human-like dexterous manipulation in real-world environments remains a central challenge in robotics. Vision-Language-Action models (VLAs) have shown progress by leveraging the versatile intelligence (broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models (VLMs). However, versatility alone is insufficient for many complex real-world tasks, which demand broader functional capabilities:

Motion Awareness: To operate in dynamic environments (e.g., interacting with moving objects).
Long-Term Memory: For sequential and long-horizon tasks requiring reasoning over past interactions.
Physical Sensing: To infer contact forces under occlusion or subtle visual changes (e.g., grasping deformable objects).

RLDX-1 is designed to address this gap. It is a general-purpose robotic policy built on a unified neural architecture (MSAT) that integrates these heterogeneous modalities. The system combines this architecture with key design choices: a synthetic data generation pipeline for rare scenarios, a specialized three-stage training procedure, and inference optimizations for real-time control. The theoretical foundation rests on extending the capabilities of VLAs beyond static scene understanding to handle the temporal, memory-dependent, and contact-rich nature of real-world manipulation.

Methodology

1. Neural Architecture

RLDX-1 consists of two main components: a temporally-aware Vision-Language Model (VLM) and a multimodal action model.

Vision-Language Model (RLDX-1-VLM):

Base Model: Built upon Qwen3-VL 8B, fine-tuned on a robot-specific Visual Question Answering (VQA) dataset to improve embodied grounding (spatial relationships, subtask inference, low-level action alignment).
Cognition Tokens: Learnable query tokens $q$ are appended to the input sequence $x = [v_t, l_t, q]$ to extract action-relevant cognition features $h_t$ from the VLM's intermediate layers.
Motion Awareness: Integrated via a motion module that captures temporal dynamics. Given video features $v_t^{(i)}$ from layer $i$ , the module computes a space-time self-similarity tensor $S_t$ and updates features as:

\tilde{v}_t^{(i)} = v_t^{(i)} + S_\theta(S_t)

Multi-frame observations are compressed into a single context token via average pooling after early LLM layers for efficiency.

Long-Term Memory: An explicit memory module maintains a queue $Q_t = [h_{t - n_{mem}H}, \cdots, h_{t-H}]$ of past cognition features. A Transformer $M_\theta$ processes $[Q_t, h_t]$ to produce memory features $m_t$ .

Action Model (Multi-Stream Action Transformer - MSAT):

Formulation: A flow-matching Diffusion Transformer (DiT) that generates a chunk of $H+1$ future actions $a_{t:t+H}$ . Given noisy action chunk $a_{t:t+H}^\tau = \tau a_{t:t+H} + (1-\tau)\epsilon$ and conditioning inputs $c_t = [h_t, m_t, s_t, p_t]$ , it learns a velocity field $u_\theta$ via the flow-matching objective:

L(\theta; t, \tau, \epsilon) = \| u_\theta(a_{t:t+H}^\tau, \tau, c_t) - (a_{t:t+H} - \epsilon) \|_2^2

During inference, actions are generated over $T$ denoising steps using Euler's method:

a_{t:t+H}^{\tau_{i+1}} = a_{t:t+H}^{\tau_i} + (\tau_{i+1} - \tau_i) u_\theta(a_{t:t+H}^{\tau_i}, \tau_i, c_t), \quad i=1,\ldots,T-1

MSAT Design: Extends the Multi-Modal Diffusion Transformer (MM-DiT) to action modeling. It processes heterogeneous modalities through dedicated streams (Cognition $C$ , Action $A$ , Physics $P$ ) coupled via joint self-attention. Streams apply their own normalization and QKV projections; outputs are concatenated for joint attention then split back.
Physics Stream: Handles physical signals $p_t$ (tactile, torque) with an auxiliary objective to predict future signals $p_{t+1:t+L}$ , encouraging the model to internalize physical interaction dynamics.
Design Choices: Uses Rotary Positional Embeddings (RoPE) on the Action stream, injects the flow-matching timestep $\tau$ as an in-context token, and employs RMSNorm and SwiGLU activation.

2. Training Data

RLDX-1 uses three complementary data sources:

Public Real-World Data: Curated from datasets like Open-X-Embodiment (OXE), DROID, Galaxea Open-World, Agibot World, Fourier ActionNet, and Humanoid Everyday (~1.5M episodes).
In-house Real-World Data: Collected on the ALLEX humanoid (48-DoF) and a sensor-augmented Franka Research 3 (FR3) platform with tactile (AnySkin) and torque sensors.
Synthetic Data: Generated via a pipeline (Fig. 4) to augment rare scenarios:
- Generation: Source demonstrations are diversified via scene augmentation (I2I editing of initial frames) and task augmentation (VLM-generated instructions). Image-to-video (I2V) models generate videos, which are annotated with actions via an Inverse Dynamics Model (IDM).
- Filtering: Two-stage filtering improves quality:
  - Video Quality Filtering: A VLM evaluates instruction following and trajectory plausibility.
  - Motion-Consistency Filtering: IDM-predicted actions are replayed in a simulator; a consistency classifier compares the rollout video to the synthetic video, retaining only high-scoring samples.

3. Training Procedure

A three-stage pipeline progressively specializes the policy.

Pre-Training: Trained on a large-scale multi-embodiment dataset (Fig. 6) for 100K steps to learn general action-prediction capabilities. Uses embodiment-specific projection layers and an embodiment-agnostic layer for generalization.

Mid-Training: Adapts the pre-trained model to target platforms (ALLEX, FR3) and injects functional capabilities (Fig. 7). Combines in-house data with synthetic data. Integrates the motion module, memory module ( $n_{mem}=3$ ), and physics stream. Training: 25K steps with modality dropout and an alignment warmup.

Post-Training: Specializes the model for downstream tasks via:

Adaptive Data Collection: Iterative protocol starting with a base dataset (balanced consistency/variance) and refining based on observed failure modes.
Reinforcement Learning (RECAP): Uses a text-based VLM critic that predicts values autoregressively using the VLM's native number tokens, enabling reliable value estimation from limited data.

4. Inference Strategy

Optimizations reduce per-step latency for real-time control.

Graph Capture Optimization: Converts the model into a static graph by precomputing constant tensors (RoPE embeddings, attention masks) and capturing the entire forward pass as a single CUDA Graph, eliminating kernel launch overhead and graph fragmentation.

Kernel Optimization: Designs custom fused kernels for critical operator groups (e.g., RMSNorm + RoPE + Attention) to minimize memory traffic and coordinate data movement within a single kernel, overcoming limitations of Torch Compile's fixed fusion patterns.

Empirical Validation / Results

Simulation Benchmarks

RLDX-1 was evaluated on a diverse suite of benchmarks (Fig. 11) and compared against frontier VLAs (π0-FAST, π0, π0.5, GR00T N1.5, GR00T N1.6).

Table 1: Results on Simulation Benchmarks

(a) Classical Simulation Benchmarks

Method	LIBERO Short	LIBERO Long	LIBERO Avg.	LIBERO-Plus	SIMPLER Google-VM	SIMPLER Google-VA	SIMPLER WidowX
π0-FAST	93.9	60.2	85.5	64.2	61.9	59.0	48.3
π0	97.1	85.2	94.1	54.6	58.8	54.8	27.1
π0.5	98.0	92.0	96.9	86.5	72.7	68.4	46.9
GR00T N1.5	90.0	76.0	86.5	66.3	52.4	43.7	62.0
GR00T N1.6	97.4	94.4	96.7	72.6	76.1	57.1	57.1
RLDX-1 (Ours)	98.6	95.3	97.8	86.7	81.5	77.4	71.9

(b) Challenging Simulation Benchmarks

Method	RoboCasa Kitchen	GR-1 Tabletop	RoboCasa365 Atomic-S	RoboCasa365 Comp.-S	RoboCasa365 Comp.-U	RoboCasa365 Avg.
π0-FAST	63.6	-	51.7	8.0	1.8	21.7
π0	62.5	13.6	34.6	6.1	1.1	14.8
π0.5	62.1	15.4	39.6	7.1	1.2	16.9
GR00T N1.5	65.7	48.0	43.0	9.6	4.4	20.0
GR00T N1.6	66.2	47.6	61.1	12.6	2.6	26.9
RLDX-1 (Ours)	70.6	58.7	67.3	19.0	5.6	32.1

Key Findings: RLDX-1 consistently outperforms baselines across all benchmarks, demonstrating superior versatility and robustness. The advantage is particularly pronounced on challenging benchmarks (GR-1 Tabletop, RoboCasa365) and under robustness shifts (LIBERO-Plus, SIMPLER Google-VA).

Real-World Experiments

OpenArm Humanoid Benchmark (Versatile Intelligence): Evaluated on tasks requiring basic manipulation, instruction following, and generalization (Fig. 13).

Figure 14: OpenArm Humanoid Benchmark Results RLDX-1 substantially outperforms baselines across all tasks, showing strong generalization. For example:

Unseen Object: RLDX-1 achieves 54.2% vs. π0.5's 37.5%.
Object Grounding: RLDX-1 achieves 87.5% vs. GR00T N1.6's 33.3%.

ALLEX Humanoid Benchmark (Functional Capabilities): Evaluated on tasks requiring motion awareness, long-term memory, and physical sensing (Fig. 15).

Figure 16: ALLEX Humanoid Benchmark Results

Task Category	π0.5 (%)	GR00T N1.6 (%)	RLDX-1 (Ours) (%)
Conveyor Pick-and-Place (Motion)	29.2	33.3	87.5
Object-in-Box Selection (Memory)	38.5	29.2	91.7
Card Slide-and-Pick (Physical)	55.3	62.3	97.2
Pot-to-Cup Pouring (Physical)	39.1	44.8	70.8
Average	39.1	44.8	86.8

RLDX-1 achieves dramatically higher success rates, demonstrating the effectiveness of its integrated functional capabilities. Baselines struggle with dynamic environments, memory-dependent choices, and contact-rich tasks.

Franka Research 3 Benchmark (Functional Capabilities): Evaluated on similar capability-specific tasks (Fig. 17).

Figure 18: Franka Research 3 Benchmark Results RLDX-1 again substantially outperforms baselines. For example:

Spin Tracking (Motion): 97.9% vs. ~30% for baselines.
Shell Game (Memory): 91.7% vs. ~50% for baselines.
Plug Insertion (Physical): 33.3% vs. ~20% for baselines.

Ablation and Analysis

VLM Design Choices (Table 2):

Layer Selection: Using intermediate layer (18) features yields the best performance (60.9% on RoboCasa Kitchen). Earlier (8) or later (28) layers degrade performance.
Robot-Specific VQA Training: Improves success rate from 57.5% to 60.9%. Attention maps show increased focus on robot embodiment and target objects after training.

Effect of Synthetic Data (Table 3): Pre-training with synthetic GR-1 humanoid data consistently improves downstream performance on the GR-1 Tabletop benchmark:

Pre-training Data (Real + Synthetic %)	Success Rate (%)
Real only (0%)	41.0
Real + 25% Synthetic	45.6
Real + 50% Synthetic	46.6
Real + 100% Synthetic	50.1

Effect of RL Application (Figure 21): On the challenging Light Bulb Twisting task, RECAP-based RL refinement significantly improves over Behavior Cloning (BC):

Episode Length: RECAP₃ completes in ~353 frames vs. BC's ~1056 frames.
Attempts: RECAP₃ uses ~4.1 attempts vs. BC's ~12.7 attempts. RECAP even surpasses human teleoperation performance, demonstrating improved speed and robustness.

Inference Optimization (Table 4):

Inference Stack	w/o physics & memory (ms)	All-modality (ms)	Speedup vs. Eager
PyTorch Eager	67.0	71.2	1.00×
CUDA Graph + Torch.Compile	56.9	59.6	~1.19×
+ Static Graph Conversion	46.2	48.9	~1.46×
+ Kernel Optimization	41.6	43.7	~1.63×

The two-stage optimization achieves a >1.6× speedup, reducing latency to ~43.7 ms, enabling real-time control.

Theoretical and Practical Implications

Theoretical Implications:

Beyond Versatile Intelligence: The work argues that for real-world dexterous manipulation, VLAs must integrate explicit functional capabilities (motion, memory, physical sensing) alongside general scene understanding. RLDX-1 provides a unified architectural framework (MSAT) to achieve this.
Architectural Integration: MSAT demonstrates how heterogeneous modalities can be effectively integrated via modality-specific streams with joint self-attention, preserving modality-specific representations while enabling cross-modal interaction for action generation.
Data Scaling: The synthetic data pipeline shows that generative models can effectively augment scarce robot data, particularly for specialized embodiments and tasks, by combining scene/task augmentation with rigorous filtering (quality and motion-consistency).

Practical Implications:

Real-World Deployment: The inference optimizations (graph capture, kernel fusion) make high-DoF, multi-modal VLAs practical for real-time control, addressing a key bottleneck for deployment.
Training Efficiency: The three-stage training pipeline (pre-training → mid-training → post-training) provides a structured approach to build generalist policies that can be efficiently specialized for specific embodiments and tasks.
Performance Gains: The significant performance improvements over state-of-the-art VLAs, especially on functional-capability-specific tasks (e.g., >90% success on ALLEX memory tasks vs. ~30% for baselines), demonstrate the practical value of the integrated design for complex, contact-rich, and dynamic manipulation.

Conclusion

RLDX-1 represents a significant step toward general-purpose robot policies capable of human-like dexterous manipulation in complex real-world environments. By unifying motion awareness, long-term memory, and physical sensing into a single VLA architecture, and combining it with scalable synthetic data generation, specialized training, and real-time inference