Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song
Tested on Ubuntu 22.04. We recommend Mambaforge for faster installation:
cd vision-in-action
mamba env create -f environment.yaml
mamba activate viaInstall ROS.
conda config --env --add channels conda-forge
conda config --env --add channels robostack-staging
conda config --env --remove channels defaults
mamba install ros-noetic-desktop-full
After completing the ROS installation, open a new terminal and run roscore. If everything is set up correctly, ROS should be installed successfully.
Build ARX robot SDK
cd arx5-sdk
mkdir build && cd build
cmake ..
make -j👀 Async Point Cloud Rendering. Using an iPhone as a robot camera, use a VisionPro for VR rendering.
⚙️ ARX Arm Setup. Make sure you can run the single-arm test scripts after the USB-CAN setup.
📍 Data Collection & Processing.
This project would not have been possible without the open-source contributions and support from the community.
Our robot controller code is powered by ARX-SDK, The VR code is supported by Vuer and OpenTeleVision. Arm teleoperation is enabled using Gello. The data processing code is adapted from BiDex. The mobile base is provided by Tidybot++, and iPhone camera streaming is from Record3D app. Model training is based on the Diffusion Policy.
This work was supported in part by the Toyota Research Institute, NSF awards #2143601, #2037101, and #2132519, the Sloan Foundation, Stanford Human-Centered AI Institute, and Intrinsic. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
If you find this codebase useful, consider citing:
@article{xiong2025via,
title = {Vision in Action: Learning Active Perception from Human Demonstrations},
author = {Haoyu Xiong and Xiaomeng Xu and Jimmy Wu and Yifan Hou and Jeannette Bohg and Shuran Song},
journal = {arXiv preprint arXiv:2506.15666},
year = {2025}
}