FASTER: Rethinking Real-Time Flow VLAs

1The University of Hong Kong 2ACE Robotics
FASTER teaser

Real-time reaction in VLAs is constrained not only by inference latency, but also by how action chunks are generated and executed. FASTER introduces a new paradigm for fast action sampling under asynchronous execution. By compressing the sampling process for immediate reaction into a single step, FASTER achieves 10x acceleration over π0.5 and X-VLA, enabling real-time responsiveness in highly dynamic tasks such as table tennis.

Abstract

Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in π0.5 and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.

Why Current Real-Time VLAs Still React Slowly

Temporal Pipelines of Sync and Async Inference

Synchronous and asynchronous inference

In the conventional synchronous pipeline, the robot pauses until the policy server completes inference, resulting in unavoidable execution gaps. In contrast, the asynchronous pipeline allows the next inference request to be launched before the current action chunk is fully executed, thereby eliminating inter-chunk pauses and improving motion continuity. However, existing asynchronous inference methods mainly focus on trajectory smoothness, while overlooking another essential dimension of real-time embodied intelligence: reaction.

Rethinking Reaction Capability

Mode Infer Interval \(\Delta t_{\text{react}} \sim D_{\text{react}}\) \(\mathbb{E}[\Delta t_{\text{react}}]\) \(s_{\min}\)
Sync \(\Delta t_{\text{infer}} + \Delta t_{\text{exec}}\) \(\mathcal{U}\!\left(\Delta t_{\text{infer}}, 2 * \Delta t_{\text{infer}} + \Delta t_{\text{exec}}\right)\) \(1.5 * \Delta t_{\text{infer}} + 0.5 * \Delta t_{\text{exec}}\) \(1\)
Async \(\Delta t_{\text{exec}}\) \(\mathcal{U}\!\left(\Delta t_{\text{infer}}, \Delta t_{\text{infer}} + \Delta t_{\text{exec}}\right)\) \(\Delta t_{\text{infer}} + 0.5 * \Delta t_{\text{exec}}\) \(\left\lceil \Delta t_{\text{infer}} / \Delta t_{\text{ctrl}} \right\rceil\)

We show that reaction time is not a trivial constant determined solely by inference latency. Instead, it should be modeled as a random variable following a uniform distribution, due to the stochastic timing of external events relative to the robot controller. We further demonstrate that existing asynchronous methods are inherently limited, and that truly responsive behavior requires joint improvements in both perception-execution latency and the frequency of the inference-execution cycle.

FASTER: Fast Action Sampling for Immediate Reaction

Pilot Study on Action Sampling

Pilot study strategy
Pilot study visualization

Existing flow-based VLAs typically employ a constant timestep schedule over the entire action chunk, allocating the same number of sampling steps to each action. Under this design, the full multi-step denoising process must be completed before any action can be executed, substantially increasing reaction latency. However, the causal structure of physical interaction makes near-term actions more strongly constrained by current observations, such that they typically reside in a much narrower solution space. Our pilot study validates this intuition: early actions exhibit straighter interpolation trajectories and can recover accurate clean actions within only a few sampling steps.

If earlier actions are easier to predict, can flow-based VLAs assign fewer sampling steps to these latency-critical actions and thereby enable immediate reaction?

Horizon-Aware Schedule

Horizon-aware schedule

FASTER introduces a Horizon-Aware Schedule (HAS) that prioritizes immediate actions during flow sampling. Specifically, HAS allocates a more aggressive sampling schedule to near-term actions while maintaining a slower schedule for long-horizon ones. As a result, the model can produce the immediate action with as little as one sampling step, without compromising long-horizon trajectory quality.

Streaming Client-Server Interface

Beyond algorithmic acceleration, FASTER further incorporates a streaming client-server interface, through which early actions can be dispatched to the robot controller as soon as they become available. While the robot executes these initial actions, the VLA model continues refining subsequent actions in parallel and progressively replenishes the client's action buffer.

Experiments

Reaction Speed Analysis

Model Method RTX 4090 RTX 4060
\( \mathrm{TTFA}\downarrow \) \( s_{\min}\downarrow \) \( \mathbb{E}[\Delta t_{\text{react}}]\downarrow \) \( \mathrm{TTFA}\downarrow \) \( s_{\min}\downarrow \) \( \mathbb{E}[\Delta t_{\text{react}}]\downarrow \)
π0.5 Sync \(80.0_{\pm 1.6}\text{ms}\) \(3\) \(170.0\text{ms}\) \(303.3_{\pm 0.8}\text{ms}\) \(10\) \(621.6\text{ms}\)
Async \(80.0_{\pm 1.6}\text{ms}\) \(3\) \(130.0\text{ms}\) \(303.3_{\pm 0.8}\text{ms}\) \(10\) \(470.0\text{ms}\)
FASTER \(\mathbf{62.1}_{\pm 3.1}\text{ms}\) \(3\) \(\mathbf{112.1}\text{ms}\) \(\mathbf{238.6}_{\pm 1.9}\text{ms}\) \(\mathbf{8}\) \(\mathbf{371.9}\text{ms}\)
Speedup \(\mathbf{1.29\times}\) - \(\mathbf{1.16\times}\) \(\mathbf{1.27\times}\) \(\mathbf{1.25\times}\) \(\mathbf{1.26\times}\)
X-VLA Sync \(113.7_{\pm 0.8}\text{ms}\) \(4\) \(237.2\text{ms}\) \(399.5_{\pm 8.5}\text{ms}\) \(12\) \(799.2\text{ms}\)
Async \(113.7_{\pm 0.8}\text{ms}\) \(4\) \(180.4\text{ms}\) \(399.5_{\pm 8.5}\text{ms}\) \(12\) \(599.5\text{ms}\)
FASTER \(\mathbf{44.8}_{\pm 0.3}\text{ms}\) \(\mathbf{2}\) \(\mathbf{78.1}\text{ms}\) \(\mathbf{129.2}_{\pm 2.4}\text{ms}\) \(\mathbf{6}\) \(\mathbf{229.2}\text{ms}\)
Speedup \(\mathbf{2.54\times}\) \(\mathbf{2\times}\) \(\mathbf{2.31\times}\) \(\mathbf{3.09\times}\) \(\mathbf{2\times}\) \(\mathbf{2.62\times}\)

FASTER delivers substantial improvements in reaction speed across all scenarios, especially when deployed on consumer-grade GPUs. In particular, it achieves a 3x reduction in Time to First Action (TTFA) on an RTX 4060, highlighting its practical advantage under resource-constrained deployment.

Playing Table Tennis

Sync Naive Async Training-time RTC FASTER

RTX 4090

1.0 0.8 0.6 0.4 0.2 0.0
0.00
0.20
0.53
0.80

RTX 4060

1.0 0.8 0.6 0.4 0.2 0.0
0.00
0.20
0.30
0.47

The table tennis task highlights the importance of immediate reaction under fast dynamics. FASTER consistently tracks and responds more effectively than the baselines across different GPU settings, demonstrating that faster action sampling directly translates into stronger real-world responsiveness. To the best of our knowledge, FASTER is the first method to enable a generalist VLA model to play table tennis on an RTX 4060.

Real-World Task Performance

Sync Naive Async Training-time RTC FASTER

Pick Beverage

Score
1.00 0.95 0.90 0.85 0.80 0.75
0.879
0.957
0.950
0.957

Fold Towel

Score
1.00 0.95 0.90 0.85 0.80 0.75
0.788
0.825
0.888
0.963

Pick Beverage

unit: s
Duration (s)
25 20 15 10 5 0
13.0
12.5
11.9
12.0

Fold Towel

unit: s
Duration (s)
25 20 15 10 5 0
24.7
24.0
20.7
20.5

On everyday tasks such as Pick Beverage and Fold Towel, FASTER achieves better or comparable completion scores and rollout durations relative to the baselines. These results highlight an important insight: task performance is determined not only by action accuracy, but also by the quality of real-time interaction with the physical world. Although accelerated sampling may introduce slight degradation in action prediction, FASTER strikes a more effective balance between responsiveness and accuracy.

Simulation Benchmarks

Method LIBERO CALVIN ABC→D
Spatial Object Goal 10 Avg. 1 2 3 4 5 Avg. Len
π0.5 98.8 98.2 98.0 92.4 96.9 94.2 88.7 85.7 83.2 79.5 4.313
π0.5+FASTER 98.6 97.8 97.8 91.6 96.5 95.1 89.1 85.0 81.9 78.1 4.292
X-VLA 97.8 99.4 97.8 96.8 98.0 95.7 89.8 82.4 77.0 70.2 4.151
X-VLA+FASTER 99.0 99.2 97.6 92.0 97.0 97.7 91.1 81.2 72.1 63.7 4.058

Although simulation benchmarks are not directly affected by inference latency, they provide a useful testbed for measuring any capability loss introduced by fast sampling. We find that FASTER preserves the original model performance with only marginal degradation on both benchmarks, indicating that HAS remains competitive despite its aggressive action-sampling strategy.

Citation


@article{lu2026faster,
  title={FASTER: Rethinking Real-Time Flow VLAs}, 
  author={Yuxiang Lu and Zhe Liu and Xianzhe Fan and Zhenya Yang and Jinghua Hou and Junyi Li and Kaixin Ding and Hengshuang Zhao},
  year={2026},
  journal={arXiv preprint arXiv:2603.19199}
}