LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo^1*, Fang Li^2*, Shaoqing Xu^2*,‡, Yang Ji², Zehan Zhang², Bing Wang², Yuannan Shen², Jianwei Cui², Long Chen², Guang Chen², Hangjun Ye², Zhi-xin Yang³, Fuxi Wen¹

^1,2 Tsinghua University, ² Xiaomi EV, ³ University of Macau

(*) Equal contribution. (‡) Project leader.

📖 Abstract

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. LaST-VLA setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

🚀 Overview

Overview of the framework. (a) Model Architecture: The model constructs a Latent CoT by aligning hidden states with dynamic and geometric priors distilled from foundation models (Cosmos and VGGT) via specialized adapters. (b) Progressive Training Strategy: The pipeline features a two-stage SFT phase that utilizes structured causal masking to enforce physically grounded reasoning, followed by RL fine-tuning to directly optimize the policy for driving safety and compliance.

🖼️ Visualization

Qualitative visualization comparing the Textual CoT baseline (Red) and LaST-VLA (Green). (a) Drivable Area Compliance (DAC): Our method maintains precise lane adherence, whereas the baseline violates spatial boundaries. (b) Time-to-Collision (TTC): Our method accurately anticipates dynamics to avoid rear-end collisions, while the baseline fails to brake effectively.

Currently Supported Features

LaST-VLA Inference Code
LaST-VLA Checkpoint
LaST-VLA Training Code
Training Dataset

Acknowledgements

We borrowed code from NAVSIM, ms-swift. Thanks for their contribution to the community.

Citation

If you find LaST-VLA useful in your research or application, please cite using this BibTex:

@article{luo2026last,
  title={LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving},
  author={Luo, Yuechen and Li, Fang and Xu, Shaoqing and Ji, Yang and Zhang, Zehan and Wang, Bing and Shen, Yuannan and Cui, Jianwei and Chen, Long and Chen, Guang and others},
  journal={arXiv preprint arXiv:2603.01928},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

📖 Abstract

🚀 Overview

🖼️ Visualization

Currently Supported Features

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

📖 Abstract

🚀 Overview

🖼️ Visualization

Currently Supported Features

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages