Skip to content

luo-yc17/LaST-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo1*, Fang Li2*, Shaoqing Xu2*,‡, Yang Ji2, Zehan Zhang2, Bing Wang2, Yuannan Shen2, Jianwei Cui2, Long Chen2, Guang Chen2, Hangjun Ye2, Zhi-xin Yang3, Fuxi Wen1

1,2 Tsinghua University, 2 Xiaomi EV, 3 University of Macau

(*) Equal contribution. (‡) Project leader.

Paper PDF

📖 Abstract

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a progressive SFT training strategy that transitions from feature alignment to trajectory generation, and refined via Reinforcement Learning with Group Relative Policy Optimization (GRPO) to ensure safety and rule compliance. LaST-VLA setting a new record on NAVSIM v1 (91.3 PDMS) and NAVSIM v2 (87.1 EPDMS), while excelling in spatial-temporal reasoning on SURDS and NuDynamics benchmarks.

🚀 Overview

Overview of the framework

Overview of the framework. (a) Model Architecture: The model constructs a Latent CoT by aligning hidden states with dynamic and geometric priors distilled from foundation models (Cosmos and VGGT) via specialized adapters. (b) Progressive Training Strategy: The pipeline features a two-stage SFT phase that utilizes structured causal masking to enforce physically grounded reasoning, followed by RL fine-tuning to directly optimize the policy for driving safety and compliance.

🖼️ Visualization

Qualitative visualization

Qualitative visualization comparing the Textual CoT baseline (Red) and LaST-VLA (Green). (a) Drivable Area Compliance (DAC): Our method maintains precise lane adherence, whereas the baseline violates spatial boundaries. (b) Time-to-Collision (TTC): Our method accurately anticipates dynamics to avoid rear-end collisions, while the baseline fails to brake effectively.

Currently Supported Features

  • LaST-VLA Inference Code
  • LaST-VLA Checkpoint
  • LaST-VLA Training Code
  • Training Dataset

Acknowledgements

We borrowed code from NAVSIM, ms-swift. Thanks for their contribution to the community.

Citation

If you find LaST-VLA useful in your research or application, please cite using this BibTex:

@article{luo2026last,
  title={LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving},
  author={Luo, Yuechen and Li, Fang and Xu, Shaoqing and Ji, Yang and Zhang, Zehan and Wang, Bing and Shen, Yuannan and Cui, Jianwei and Chen, Long and Chen, Guang and others},
  journal={arXiv preprint arXiv:2603.01928},
  year={2026}
}

About

Repo of "LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving""

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors