FASTER: Rethinking Real-Time Flow VLAs - Summary
Summary (Overview)
- Identifies Reaction Latency Bottleneck: Demonstrates that the constant timestep schedule used in flow-based VLAs forces the entire multi-step denoising process to complete before any action can be dispatched, creating the primary bottleneck for real-time responsiveness.
- Proposes Horizon-Aware Schedule (HAS): Introduces FASTER, a plug-and-play method that uses an adaptive schedule to prioritize the sampling of near-term, latency-critical actions, compressing their generation into a single step while preserving long-horizon trajectory quality.
- Enables Streaming Output & Early Stopping: Implements a streaming client-server interface where early actions are dispatched immediately upon generation, and an early-stopping strategy skips unnecessary sampling steps, jointly reducing Time to First Action (TTFA) and increasing inference frequency.
- Achieves Significant Real-World Speedup: Empirically shows FASTER achieves up to 10x faster sampling for immediate actions and substantially reduces expected reaction time (e.g., 2.31x-2.62x speedup for X-VLA), enabling highly dynamic tasks like table tennis on consumer-grade GPUs.
Introduction and Theoretical Foundation
The deployment of Vision-Language-Action (VLA) models in the physical world demands real-time execution. While existing asynchronous inference methods optimize for trajectory smoothness by eliminating inter-chunk pauses, they critically overlook reaction latency—the delay in responding to environmental changes.
The paper provides a systematic analysis, revealing that reaction time is not a constant but a random variable following a uniform distribution, determined jointly by inference latency and the interval between inference-execution cycles. The lower bound is the inference latency , and the upper bound is plus the inference interval.
A key insight is that the standard practice in flow-based VLAs of applying a constant timestep schedule across the entire action chunk is inefficient. It forces the system to complete all denoising steps before any movement can start. The authors hypothesize that near-term actions are easier to predict (lie in a narrower solution space) and thus should require fewer sampling steps. This leads to the introduction of Time to First Action (TTFA) as the precise metric for measuring reactivity, analogous to Time to First Token (TTFT) in LLMs.
Methodology
1. Preliminaries: Flow-Based VLAs
The adopted flow-based VLA structure learns a velocity field via conditional flow matching. The training objective is:
where is a linear interpolation between noise and ground-truth actions .
During inference, actions are generated by integrating the learned velocity field from to using an ODE solver (e.g., Euler method):
where is the number of sampling steps (typically 10).
2. Horizon-Aware Schedule (HAS)
FASTER replaces the constant schedule with an index-dependent timestep vector , where is the action index.
- Hit Time: Each action has a predefined "hit time" (the global timestep at which it is fully denoised), determined by:
Here, is the hit time for the first action (set to to ensure one-step completion), and controls the schedule's aggressiveness ( accelerates early actions more).
- Local Timestep Calculation: Given the global sampling progress at step , the local timestep for action is:
Under this schedule, actions are finalized and can be dispatched progressively as reaches their respective .
3. Fine-tuning with Mixed Schedule
To maintain robustness and prevent distribution shift, fine-tuning uses a mixed strategy: with probability , training uses HAS; with probability , it uses the original constant schedule.
4. System Integration: Streaming & Early Stopping
- Streaming Client-Server: The server dispatches actions immediately upon completion, while the client robot executes them without waiting for the full chunk.
- Early Stopping: Once all actions within the execution horizon are finalized, remaining sampling steps are skipped, reducing overall latency and allowing for a smaller, more reactive .
Empirical Validation / Results
1. Reaction Speed Analysis
Experiments on RTX 4090 and RTX 4060 GPUs with and X-VLA models show FASTER significantly improves reactivity.
Table 2: Comparison of reaction capability on RTX 4090 and RTX 4060 GPUs.
| Model | Method | RTX 4090 | RTX 4060 |
|---|---|---|---|
| TTFA ↓ | s_min ↓ | ||
| π 0.5 | Sync | 80.0 ± 1.6 ms | 3 |
| Async | 80.0 ± 1.6 ms | 3 | |
| FASTER | 62.1 ± 3.1 ms | 3 | |
| Speedup | 1.29× | – | |
| X-VLA | Sync | 113.7 ± 0.8 ms | 4 |
| Async | 113.7 ± 0.8 ms | 4 | |
| FASTER | 44.8 ± 0.3 ms | 2 | |
| Speedup | 2.54× | 2× |
FASTER reduces TTFA and expected reaction time, and allows for a smaller s_min (especially for X-VLA), increasing inference frequency.
Table 3: Probabilistic comparison of reaction speed.
| Model | Method | RTX 4090 | RTX 4060 |
|---|---|---|---|
| vs. Sync | vs. Async | ||
| π 0.5 | Async | 0.72 | – |
| FASTER | 0.81 | 0.66 | |
| X-VLA | Async | 0.73 | – |
| FASTER | 1.00 | 1.00 |
FASTER is deterministically faster than baselines for X-VLA, and has a high probability of being faster for π 0.5.
2. Real-World Robot Experiments
In a highly dynamic table tennis task, FASTER enables successful ball returns where synchronous inference fails. Qualitative results show FASTER allows earlier racket adjustment, leading to better contact angles and more powerful hits.
Fig. 5/6: Real-world task scores.
- Table Tennis (RTX 4060): FASTER (Score: 0.47) outperforms Sync (0.00), Naive Async (0.20), and Training-time RTC (0.30).
- Additional Tasks (Pick Beverage, Fold Towel): FASTER achieves superior or comparable success scores while significantly reducing task completion duration compared to synchronous baselines.
3. Simulation Benchmark Performance
Table 4: Performance on LIBERO and CALVIN benchmarks.
| Method | LIBERO (Avg.) | CALVIN (Avg.) | ABC → D Len |
|---|---|---|---|
| π 0.5 | 96.9 | 83.2 | 4.313 |
| π 0.5 + FASTER | 96.5 | 81.9 | 4.292 |
| X-VLA | 98.0 | 77.0 | 4.151 |
| X-VLA + FASTER | 97.0 | 72.1 | 4.058 |
FASTER maintains competitive performance with only marginal degradation, confirming it preserves core task-solving capabilities.
Theoretical and Practical Implications
- Theoretical: Provides a formal analysis modeling reaction time as a uniform random variable and identifies the constant sampling schedule as a fundamental latency bottleneck in flow-based policies. Introduces TTFA as a crucial metric for embodied AI responsiveness.
- Practical: FASTER is a plug-and-play solution requiring no architectural changes or extra training cost. It enables real-time VLA deployment on consumer-grade GPUs (e.g., RTX 4060), making advanced robotic control more accessible. The method is orthogonal and complementary to other efficiency techniques like model compression or quantization.
Conclusion
FASTER addresses the critical reaction latency bottleneck in flow-based VLAs by rethinking the action sampling schedule. Through a Horizon-Aware Schedule that prioritizes immediate actions, coupled with a streaming execution pipeline, it achieves order-of-magnitude faster sampling for latency-critical actions. Real-robot experiments validate its superior responsiveness in dynamic tasks. While aggressive sampling may slightly impact long-horizon accuracy in static benchmarks, FASTER establishes a more favorable trade-off for real-world robotic manipulation, offering a general path toward genuinely real-time embodied intelligence.