FASTER: Rethinking Real-Time Flow VLAs - Summary

Summary (Overview)

Identifies Reaction Latency Bottleneck: Demonstrates that the constant timestep schedule used in flow-based VLAs forces the entire multi-step denoising process to complete before any action can be dispatched, creating the primary bottleneck for real-time responsiveness.
Proposes Horizon-Aware Schedule (HAS): Introduces FASTER, a plug-and-play method that uses an adaptive schedule to prioritize the sampling of near-term, latency-critical actions, compressing their generation into a single step while preserving long-horizon trajectory quality.
Enables Streaming Output & Early Stopping: Implements a streaming client-server interface where early actions are dispatched immediately upon generation, and an early-stopping strategy skips unnecessary sampling steps, jointly reducing Time to First Action (TTFA) and increasing inference frequency.
Achieves Significant Real-World Speedup: Empirically shows FASTER achieves up to 10x faster sampling for immediate actions and substantially reduces expected reaction time (e.g., 2.31x-2.62x speedup for X-VLA), enabling highly dynamic tasks like table tennis on consumer-grade GPUs.

Introduction and Theoretical Foundation

The deployment of Vision-Language-Action (VLA) models in the physical world demands real-time execution. While existing asynchronous inference methods optimize for trajectory smoothness by eliminating inter-chunk pauses, they critically overlook reaction latency—the delay in responding to environmental changes.

The paper provides a systematic analysis, revealing that reaction time $\Delta t_{react}$ is not a constant but a random variable following a uniform distribution, determined jointly by inference latency and the interval between inference-execution cycles. The lower bound is the inference latency $\Delta t_{infer}$ , and the upper bound is $\Delta t_{infer}$ plus the inference interval.

A key insight is that the standard practice in flow-based VLAs of applying a constant timestep schedule across the entire action chunk is inefficient. It forces the system to complete all denoising steps before any movement can start. The authors hypothesize that near-term actions are easier to predict (lie in a narrower solution space) and thus should require fewer sampling steps. This leads to the introduction of Time to First Action (TTFA) as the precise metric for measuring reactivity, analogous to Time to First Token (TTFT) in LLMs.

Methodology

1. Preliminaries: Flow-Based VLAs

The adopted flow-based VLA structure learns a velocity field via conditional flow matching. The training objective is:

L(\theta) = \mathbb{E}_{\tau \sim \mathcal{U}(0,1)} \left[ \| v_{\theta}(o_t, A^{\tau}_t, \tau) - (\epsilon - \hat{A}_t) \|^2 \right] \tag{1}

where $A^{\tau}_t = \tau \epsilon + (1-\tau) \hat{A}_t$ is a linear interpolation between noise $\epsilon \sim \mathcal{N}(0, I)$ and ground-truth actions $\hat{A}_t$ .

During inference, actions are generated by integrating the learned velocity field from $\tau=1$ to $\tau=0$ using an ODE solver (e.g., Euler method):

A^{\tau + \Delta\tau}_t = A^{\tau}_t + v_{\theta}(o_t, A^{\tau}_t, \tau) \Delta\tau, \quad \Delta\tau = -1/N \tag{2}

where $N$ is the number of sampling steps (typically 10).

2. Horizon-Aware Schedule (HAS)

FASTER replaces the constant schedule with an index-dependent timestep vector $\tau = \{\tau_i\}$ , where $i \in [0, H-1]$ is the action index.

Hit Time: Each action $i$ has a predefined "hit time" $u_i$ (the global timestep $\rho$ at which it is fully denoised), determined by:

u_i = (1 - i/(H-1))^{\alpha} * u_0 \quad i \in [1, H-1] \tag{5}

Here, $u_0$ is the hit time for the first action (set to $(N-1)/N$ to ensure one-step completion), and $\alpha \in (0,1]$ controls the schedule's aggressiveness ( $\alpha<1$ accelerates early actions more).

Local Timestep Calculation: Given the global sampling progress $\rho_j$ at step $j$ , the local timestep for action $i$ is:

\tau^j_i = \max\left(0, (\rho_j - u_i)/(1 - u_i)\right) \tag{6}

Under this schedule, actions are finalized and can be dispatched progressively as $\rho_j$ reaches their respective $u_i$ .

3. Fine-tuning with Mixed Schedule

To maintain robustness and prevent distribution shift, fine-tuning uses a mixed strategy: with probability $p$ , training uses HAS; with probability $1-p$ , it uses the original constant schedule.

4. System Integration: Streaming & Early Stopping

Streaming Client-Server: The server dispatches actions immediately upon completion, while the client robot executes them without waiting for the full chunk.
Early Stopping: Once all actions within the execution horizon $s$ are finalized, remaining sampling steps are skipped, reducing overall latency and allowing for a smaller, more reactive $s_{min}$ .

Empirical Validation / Results

1. Reaction Speed Analysis

Experiments on RTX 4090 and RTX 4060 GPUs with $\pi 0.5$ and X-VLA models show FASTER significantly improves reactivity.

Table 2: Comparison of reaction capability on RTX 4090 and RTX 4060 GPUs.

Model	Method	RTX 4090	RTX 4060
		TTFA ↓	s_min ↓
π 0.5	Sync	80.0 ± 1.6 ms	3
	Async	80.0 ± 1.6 ms	3
	FASTER	62.1 ± 3.1 ms	3
	Speedup	1.29×	–
X-VLA	Sync	113.7 ± 0.8 ms	4
	Async	113.7 ± 0.8 ms	4
	FASTER	44.8 ± 0.3 ms	2
	Speedup	2.54×	2×

FASTER reduces TTFA and expected reaction time, and allows for a smaller s_min (especially for X-VLA), increasing inference frequency.

Table 3: Probabilistic comparison of reaction speed.

Model	Method	RTX 4090	RTX 4060
		vs. Sync	vs. Async
π 0.5	Async	0.72	–
	FASTER	0.81	0.66
X-VLA	Async	0.73	–
	FASTER	1.00	1.00

FASTER is deterministically faster than baselines for X-VLA, and has a high probability of being faster for π 0.5.

2. Real-World Robot Experiments

In a highly dynamic table tennis task, FASTER enables successful ball returns where synchronous inference fails. Qualitative results show FASTER allows earlier racket adjustment, leading to better contact angles and more powerful hits.

Fig. 5/6: Real-world task scores.

Table Tennis (RTX 4060): FASTER (Score: 0.47) outperforms Sync (0.00), Naive Async (0.20), and Training-time RTC (0.30).
Additional Tasks (Pick Beverage, Fold Towel): FASTER achieves superior or comparable success scores while significantly reducing task completion duration compared to synchronous baselines.

3. Simulation Benchmark Performance

Table 4: Performance on LIBERO and CALVIN benchmarks.

Method	LIBERO (Avg.)	CALVIN (Avg.)	ABC → D Len
π 0.5	96.9	83.2	4.313
π 0.5 + FASTER	96.5	81.9	4.292
X-VLA	98.0	77.0	4.151
X-VLA + FASTER	97.0	72.1	4.058

FASTER maintains competitive performance with only marginal degradation, confirming it preserves core task-solving capabilities.

Theoretical and Practical Implications

Theoretical: Provides a formal analysis modeling reaction time as a uniform random variable and identifies the constant sampling schedule as a fundamental latency bottleneck in flow-based policies. Introduces TTFA as a crucial metric for embodied AI responsiveness.
Practical: FASTER is a plug-and-play solution requiring no architectural changes or extra training cost. It enables real-time VLA deployment on consumer-grade GPUs (e.g., RTX 4060), making advanced robotic control more accessible. The method is orthogonal and complementary to other efficiency techniques like model compression or quantization.

Conclusion

FASTER addresses the critical reaction latency bottleneck in flow-based VLAs by rethinking the action sampling schedule. Through a Horizon-Aware Schedule that prioritizes immediate actions, coupled with a streaming execution pipeline, it achieves order-of-magnitude faster sampling for latency-critical actions. Real-robot experiments validate its superior responsiveness in dynamic tasks. While aggressive sampling may slightly impact long-horizon accuracy in static benchmarks, FASTER establishes a more favorable trade-off for real-world robotic manipulation, offering a general path toward genuinely real-time embodied intelligence.