FASTER: Rethinking Real-Time Flow VLAs

Yuxiang Lu^1,2, Zhe Liu¹, Xianzhe Fan¹, Zhenya Yang¹, Jinghua Hou¹, Junyi Li¹, Kaixin Ding¹, Hengshuang Zhao¹

¹The University of Hong Kong ²ACE Robotics

Real-time reaction in VLAs is constrained not only by inference latency, but also by how action chunks are generated and executed. FASTER introduces a new paradigm for fast action sampling under asynchronous execution. By compressing the sampling process for immediate reaction into a single step, FASTER achieves 10x acceleration over π_0.5 and X-VLA, enabling real-time responsiveness in highly dynamic tasks such as table tennis.

Abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in π_0.5 and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks substantially improved real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

Why Current Real-Time VLAs Still React Slowly

Temporal Pipelines of Sync and Async Inference

In the conventional synchronous pipeline, the robot pauses until the policy server completes inference, resulting in unavoidable execution gaps. In contrast, the asynchronous pipeline allows the next inference request to be launched before the current action chunk is fully executed, thereby eliminating inter-chunk pauses and improving motion continuity. However, existing asynchronous inference methods mainly focus on trajectory smoothness, while overlooking another essential dimension of real-time embodied intelligence: reaction.

Rethinking Reaction Capability

Mode	Infer Interval	\(\Delta t_{\text{react}} \sim D_{\text{react}}\)	\(\mathbb{E}[\Delta t_{\text{react}}]\)	\(s_{\min}\)
Sync	\(\Delta t_{\text{infer}} + \Delta t_{\text{exec}}\)	\(\mathcal{U}\!\left(\Delta t_{\text{infer}}, 2 * \Delta t_{\text{infer}} + \Delta t_{\text{exec}}\right)\)	\(1.5 * \Delta t_{\text{infer}} + 0.5 * \Delta t_{\text{exec}}\)	\(-\)
Async	\(\Delta t_{\text{exec}}\)	\(\mathcal{U}\!\left(\Delta t_{\text{infer}}, \Delta t_{\text{infer}} + \Delta t_{\text{exec}}\right)\)	\(\Delta t_{\text{infer}} + 0.5 * \Delta t_{\text{exec}}\)	\(\left\lceil \Delta t_{\text{infer}} / \Delta t_{\text{ctrl}} \right\rceil\)

We show that reaction time is not a trivial constant determined solely by inference latency. Instead, it should be modeled as a random variable following a uniform distribution, due to the stochastic timing of external events relative to the robot controller. We further demonstrate that existing asynchronous methods are inherently limited, and that truly responsive behavior requires joint improvements in both perception-execution latency and the frequency of the inference-execution cycle.

FASTER: Fast Action Sampling for Immediate Reaction

Pilot Study on Action Sampling

Existing flow-based VLAs typically employ a constant timestep schedule over the entire action chunk, allocating the same number of sampling steps to each action. Under this design, the full multi-step denoising process must be completed before any action can be executed, substantially increasing reaction latency. However, the causal structure of physical interaction makes near-term actions more strongly constrained by current observations, such that they typically reside in a much narrower solution space. Our pilot study validates this intuition: early actions exhibit straighter interpolation trajectories and can recover accurate clean actions within only a few sampling steps.

Since earlier actions are easier to predict, can flow-based VLAs generate these latency-critical actions with fewer sampling steps for immediate reaction?

Horizon-Aware Schedule

FASTER introduces a Horizon-Aware Schedule (HAS) that prioritizes immediate actions during flow sampling. Specifically, HAS allocates a more aggressive sampling schedule to near-term actions while maintaining a slower schedule for long-horizon ones. As a result, the model can produce the immediate action with as little as one sampling step, without compromising long-horizon trajectory quality.

Streaming Client-Server Interface

Beyond algorithmic acceleration, FASTER further incorporates a streaming client-server interface, through which early actions can be dispatched to the robot controller as soon as they become available. While the robot executes these initial actions, the VLA model continues refining subsequent actions in parallel and progressively replenishes the client's action buffer.

Experiments

Reaction Speed Analysis

Model	Method	RTX 4090			RTX 4060
Model	Method	\( \mathrm{TTFA}\downarrow \)	\( s_{\min}\downarrow \)	\( \mathbb{E}[\Delta t_{\text{react}}]\downarrow \)	\( \mathrm{TTFA}\downarrow \)	\( s_{\min}\downarrow \)	\( \mathbb{E}[\Delta t_{\text{react}}]\downarrow \)
π_0.5	Sync	\(80.0_{\pm 1.6}\text{ms}\)	\(3\)	\(170.0\text{ms}\)	\(303.3_{\pm 0.8}\text{ms}\)	\(10\)	\(621.6\text{ms}\)
	Async	\(80.0_{\pm 1.6}\text{ms}\)	\(3\)	\(130.0\text{ms}\)	\(303.3_{\pm 0.8}\text{ms}\)	\(10\)	\(470.0\text{ms}\)
	FASTER	\(\mathbf{62.1}_{\pm 3.1}\text{ms}\)	\(3\)	\(\mathbf{112.1}\text{ms}\)	\(\mathbf{238.6}_{\pm 1.9}\text{ms}\)	\(\mathbf{8}\)	\(\mathbf{371.9}\text{ms}\)
	Speedup	\(\mathbf{1.29\times}\)	-	\(\mathbf{1.16\times}\)	\(\mathbf{1.27\times}\)	\(\mathbf{1.25\times}\)	\(\mathbf{1.26\times}\)
X-VLA	Sync	\(113.7_{\pm 0.8}\text{ms}\)	\(4\)	\(237.2\text{ms}\)	\(399.5_{\pm 8.5}\text{ms}\)	\(12\)	\(799.2\text{ms}\)
	Async	\(113.7_{\pm 0.8}\text{ms}\)	\(4\)	\(180.4\text{ms}\)	\(399.5_{\pm 8.5}\text{ms}\)	\(12\)	\(599.5\text{ms}\)
	FASTER	\(\mathbf{44.8}_{\pm 0.3}\text{ms}\)	\(\mathbf{2}\)	\(\mathbf{78.1}\text{ms}\)	\(\mathbf{129.2}_{\pm 2.4}\text{ms}\)	\(\mathbf{6}\)	\(\mathbf{229.2}\text{ms}\)
	Speedup	\(\mathbf{2.54\times}\)	\(\mathbf{2\times}\)	\(\mathbf{2.31\times}\)	\(\mathbf{3.09\times}\)	\(\mathbf{2\times}\)	\(\mathbf{2.62\times}\)

FASTER delivers substantial improvements in reaction speed across all scenarios, especially when deployed on consumer-grade GPUs. In particular, it achieves a 3x reduction in Time to First Action (TTFA) on an RTX 4060, highlighting its practical advantage under resource-constrained deployment.

Playing Table Tennis

Autonomous, RTX 4060, 0.2x, No Post-processing

RTX 4090

1.0 0.8 0.6 0.4 0.2 0.0

0.00

0.20

0.53

0.80

RTX 4060

1.0 0.8 0.6 0.4 0.2 0.0

0.00

0.20

0.30

0.47

The table tennis task highlights the importance of immediate reaction under fast dynamics. FASTER consistently tracks and responds more effectively than the baselines across different GPU settings, demonstrating that faster action sampling directly translates into stronger real-world responsiveness. To the best of our knowledge, FASTER is the first method to enable a generalist VLA model to play table tennis on an RTX 4060.

Real-World Task Performance

Pick Beverage

Fold Towel

Autonomous, RTX 4090, 1x, No Post-processing

Pick Beverage

Score

1.00 0.95 0.90 0.85 0.80 0.75

0.879

0.957

0.950

0.957

Fold Towel

Score

1.00 0.95 0.90 0.85 0.80 0.75

0.788

0.825

0.888

0.963

Pick Beverage

unit: s

Duration (s)

25 20 15 10 5 0

13.0

12.5

11.9

12.0

Fold Towel

unit: s

Duration (s)

25 20 15 10 5 0

24.7

24.0

20.7

20.5

On everyday tasks such as Pick Beverage and Fold Towel, FASTER achieves better or comparable completion scores and rollout durations relative to the baselines. These results highlight an important insight: task performance is determined not only by action accuracy, but also by the quality of real-time interaction with the physical world. Although accelerated sampling may introduce slight degradation in action prediction, FASTER strikes a more effective balance between responsiveness and accuracy.

Citation


@article{lu2026faster,
  title={FASTER: Rethinking Real-Time Flow VLAs}, 
  author={Yuxiang Lu and Zhe Liu and Xianzhe Fan and Zhenya Yang and Jinghua Hou and Junyi Li and Kaixin Ding and Hengshuang Zhao},
  year={2026},
  journal={arXiv preprint arXiv:2603.19199}
}