# FASTER: Rethinking Real-Time Flow VLAs

> FASTER introduces a horizon-aware schedule that prioritizes near-term actions, enabling up to 10x faster sampling for immediate actions and significantly reducing reaction time in flow-based VLAs.

- **Source:** [arXiv](https://arxiv.org/abs/2603.19199)
- **Published:** 2026-03-21
- **Permalink:** https://picx.dev/p/0jiBGu
- **Whiteboard:** https://picx.dev/p/0jiBGu/image

## Summary

# FASTER: Rethinking Real-Time Flow VLAs - Summary

## Summary (Overview)
*   **Identifies Reaction Latency Bottleneck:** Demonstrates that the constant timestep schedule used in flow-based VLAs forces the entire multi-step denoising process to complete before any action can be dispatched, creating the primary bottleneck for real-time responsiveness.
*   **Proposes Horizon-Aware Schedule (HAS):** Introduces FASTER, a plug-and-play method that uses an adaptive schedule to prioritize the sampling of near-term, latency-critical actions, compressing their generation into a single step while preserving long-horizon trajectory quality.
*   **Enables Streaming Output & Early Stopping:** Implements a streaming client-server interface where early actions are dispatched immediately upon generation, and an early-stopping strategy skips unnecessary sampling steps, jointly reducing Time to First Action (TTFA) and increasing inference frequency.
*   **Achieves Significant Real-World Speedup:** Empirically shows FASTER achieves up to **10x faster sampling** for immediate actions and substantially reduces expected reaction time (e.g., **2.31x-2.62x** speedup for X-VLA), enabling highly dynamic tasks like table tennis on consumer-grade GPUs.

## Introduction and Theoretical Foundation
The deployment of Vision-Language-Action (VLA) models in the physical world demands real-time execution. While existing asynchronous inference methods optimize for trajectory smoothness by eliminating inter-chunk pauses, they critically overlook **reaction latency**—the delay in responding to environmental changes.

The paper provides a systematic analysis, revealing that reaction time $\Delta t_{react}$ is not a constant but a **random variable following a uniform distribution**, determined jointly by inference latency and the interval between inference-execution cycles. The lower bound is the inference latency $\Delta t_{infer}$, and the upper bound is $\Delta t_{infer}$ plus the inference interval.

A key insight is that the standard practice in flow-based VLAs of applying a **constant timestep schedule** across the entire action chunk is inefficient. It forces the system to complete all denoising steps before any movement can start. The authors hypothesize that near-term actions are easier to predict (lie in a narrower solution space) and thus should require fewer sampling steps. This leads to the introduction of **Time to First Action (TTFA)** as the precise metric for measuring reactivity, analogous to Time to First Token (TTFT) in LLMs.

## Methodology

### 1. Preliminaries: Flow-Based VLAs
The adopted flow-based VLA structure learns a velocity field via conditional flow matching. The training objective is:
$$L(\theta) = \mathbb{E}_{\tau \sim \mathcal{U}(0,1)} \left[ \| v_{\theta}(o_t, A^{\tau}_t, \tau) - (\epsilon - \hat{A}_t) \|^2 \right] \tag{1}$$
where $A^{\tau}_t = \tau \epsilon + (1-\tau) \hat{A}_t$ is a linear interpolation between noise $\epsilon \sim \mathcal{N}(0, I)$ and ground-truth actions $\hat{A}_t$.

During inference, actions are generated by integrating the learned velocity field from $\tau=1$ to $\tau=0$ using an ODE solver (e.g., Euler method):
$$A^{\tau + \Delta\tau}_t = A^{\tau}_t + v_{\theta}(o_t, A^{\tau}_t, \tau) \Delta\tau, \quad \Delta\tau = -1/N \tag{2}$$
where $N$ is the number of sampling steps (typically 10).

### 2. Horizon-Aware Schedule (HAS)
FASTER replaces the constant schedule with an index-dependent timestep vector $\tau = \{\tau_i\}$, where $i \in [0, H-1]$ is the action index.

*   **Hit Time:** Each action $i$ has a predefined "hit time" $u_i$ (the global timestep $\rho$ at which it is fully denoised), determined by:
$$u_i = (1 - i/(H-1))^{\alpha} * u_0 \quad i \in [1, H-1] \tag{5}$$
Here, $u_0$ is the hit time for the first action (set to $(N-1)/N$ to ensure one-step completion), and $\alpha \in (0,1]$ controls the schedule's aggressiveness ($\alpha<1$ accelerates early actions more).
*   **Local Timestep Calculation:** Given the global sampling progress $\rho_j$ at step $j$, the local timestep for action $i$ is:
$$\tau^j_i = \max\left(0, (\rho_j - u_i)/(1 - u_i)\right) \tag{6}$$
Under this schedule, actions are finalized and can be dispatched progressively as $\rho_j$ reaches their respective $u_i$.

### 3. Fine-tuning with Mixed Schedule
To maintain robustness and prevent distribution shift, fine-tuning uses a mixed strategy: with probability $p$, training uses HAS; with probability $1-p$, it uses the original constant schedule.

### 4. System Integration: Streaming & Early Stopping
*   **Streaming Client-Server:** The server dispatches actions immediately upon completion, while the client robot executes them without waiting for the full chunk.
*   **Early Stopping:** Once all actions within the execution horizon $s$ are finalized, remaining sampling steps are skipped, reducing overall latency and allowing for a smaller, more reactive $s_{min}$.

## Empirical Validation / Results

### 1. Reaction Speed Analysis
Experiments on RTX 4090 and RTX 4060 GPUs with $\pi 0.5$ and X-VLA models show FASTER significantly improves reactivity.

**Table 2: Comparison of reaction capability on RTX 4090 and RTX 4060 GPUs.**
| Model   | Method    | RTX 4090                            | RTX 4060                            |
| :------ | :-------- | :---------------------------------- | :---------------------------------- |
|         |           | TTFA ↓ | s_min ↓ | E[∆t_react] ↓ | TTFA ↓ | s_min ↓ | E[∆t_react] ↓ |
| **π 0.5** | Sync      | 80.0 ± 1.6 ms | 3 | 170.0 ms | 303.3 ± 0.8 ms | 10 | 621.6 ms |
|         | Async     | 80.0 ± 1.6 ms | 3 | 130.0 ms | 303.3 ± 0.8 ms | 10 | 470.0 ms |
|         | **FASTER**| **62.1 ± 3.1 ms** | **3** | **112.1 ms** | **238.6 ± 1.9 ms** | **8** | **371.9 ms** |
|         | Speedup   | 1.29× | – | 1.16× | 1.27× | 1.25× | 1.26× |
| **X-VLA** | Sync      | 113.7 ± 0.8 ms | 4 | 237.2 ms | 399.5 ± 8.5 ms | 12 | 799.2 ms |
|         | Async     | 113.7 ± 0.8 ms | 4 | 180.4 ms | 399.5 ± 8.5 ms | 12 | 599.5 ms |
|         | **FASTER**| **44.8 ± 0.3 ms** | **2** | **78.1 ms** | **129.2 ± 2.4 ms** | **6** | **229.2 ms** |
|         | Speedup   | 2.54× | 2× | 2.31× | 3.09× | 2× | 2.62× |

*FASTER reduces TTFA and expected reaction time, and allows for a smaller `s_min` (especially for X-VLA), increasing inference frequency.*

**Table 3: Probabilistic comparison of reaction speed.**
| Model   | Method    | RTX 4090            | RTX 4060            |
| :------ | :-------- | :------------------ | :------------------ |
|         |           | vs. Sync | vs. Async | vs. Sync | vs. Async |
| **π 0.5** | Async     | 0.72     | –         | 0.74     | –         |
|         | **FASTER**| **0.81** | **0.66**  | **0.88** | **0.77**  |
| **X-VLA** | Async     | 0.73     | –         | 0.75     | –         |
|         | **FASTER**| **1.00** | **1.00**  | **1.00** | **1.00**  |

*FASTER is deterministically faster than baselines for X-VLA, and has a high probability of being faster for π 0.5.*

### 2. Real-World Robot Experiments
In a highly dynamic **table tennis task**, FASTER enables successful ball returns where synchronous inference fails. Qualitative results show FASTER allows earlier racket adjustment, leading to better contact angles and more powerful hits.

**Fig. 5/6: Real-world task scores.**
*   **Table Tennis (RTX 4060):** FASTER (Score: 0.47) outperforms Sync (0.00), Naive Async (0.20), and Training-time RTC (0.30).
*   **Additional Tasks (Pick Beverage, Fold Towel):** FASTER achieves superior or comparable success scores while significantly reducing task completion duration compared to synchronous baselines.

### 3. Simulation Benchmark Performance
**Table 4: Performance on LIBERO and CALVIN benchmarks.**
| Method                | LIBERO (Avg.) | CALVIN (Avg.) | ABC → D Len |
| :-------------------- | :------------ | :------------ | :---------- |
| π 0.5                 | 96.9          | 83.2          | 4.313       |
| π 0.5 + **FASTER**    | **96.5**      | **81.9**      | **4.292**   |
| X-VLA                 | 98.0          | 77.0          | 4.151       |
| X-VLA + **FASTER**    | **97.0**      | **72.1**      | **4.058**   |

*FASTER maintains competitive performance with only marginal degradation, confirming it preserves core task-solving capabilities.*

## Theoretical and Practical Implications
*   **Theoretical:** Provides a formal analysis modeling reaction time as a uniform random variable and identifies the constant sampling schedule as a fundamental latency bottleneck in flow-based policies. Introduces TTFA as a crucial metric for embodied AI responsiveness.
*   **Practical:** FASTER is a **plug-and-play** solution requiring no architectural changes or extra training cost. It enables **real-time VLA deployment on consumer-grade GPUs** (e.g., RTX 4060), making advanced robotic control more accessible. The method is orthogonal and complementary to other efficiency techniques like model compression or quantization.

## Conclusion
FASTER addresses the critical reaction latency bottleneck in flow-based VLAs by rethinking the action sampling schedule. Through a Horizon-Aware Schedule that prioritizes immediate actions, coupled with a streaming execution pipeline, it achieves order-of-magnitude faster sampling for latency-critical actions. Real-robot experiments validate its superior responsiveness in dynamic tasks. While aggressive sampling may slightly impact long-horizon accuracy in static benchmarks, FASTER establishes a more favorable trade-off for real-world robotic manipulation, offering a general path toward genuinely real-time embodied intelligence.

---

_Markdown view of https://picx.dev/p/0jiBGu, served by PicX — AI-generated visual whiteboard summaries of research papers._
