VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
🎓 Paper | 🌐 Project Website | 🤗 Hugging Face
- 2025/3/25 Releasing standard evaluation episodes and primitive task finetune dataset.
- 2025/2/26 Releasing referenced evaluation pipeline.
- 2025/2/14 Releasing the scripts for trajectory generation.
- 2024/12/25 The preview verison of VLABench has been released! The preview version showcases most of the designed tasks and structure, but the functionalities are still being managed and tested.
- Prepare conda environment
conda create -n vlabench python=3.10
conda activate vlabench
git clone https://github.com/OpenMOSS/VLABench.git
cd VLABench
pip install -r requirements.txt
pip install -e .
- Download the assets
python script/download_assetes.py
- (Option) Initialize submodules
git submodule update --init --recursive
This will update other policies repos such openpi.
The script will automatically download the necessary assets and unzip them into the correct directory.
We provide a brief tutorial in tutorials/2.auto_trajectory_generate.ipynb
and the whole codes are in scripts/trajectory_generation.py
. Trajectory generation can be sped up several times by using multiple processes. A naive way to use it is:
sh data_generation.sh
Currently, the version does not support multi-processing environment within the code. We will optimize the collection efficiency as much as possible in future updates. After running the script, each trajectory will be stored as a hdf5 file in the directory you specify.
Due to some frameworks such as Octo and Openvla using data in the RLDS format for training, we refer to the process from rlds_dataset_builder to provide an example of converting the aforementioned HDF5 dataset into RLDS format data. First, run
python scripts/convert_to_rlds.py --task [list] --save_dir /your/path/to/dataset
This will create a python file including the task rlds-builder in the directory. Then
cd /your/path/to/dataset/task
tfds build
This process consumes a long time with only single process, and we are testing multithreading mthod yet. The codes of original repo seem to have some bugs.
Following the Libero dataset process way of openpi, we offer a simple way to convert hdf5 data files into lerobot format. Run the script by
python scripts/convert_to_lerobot.py --dataset-name [your-dataset-name] --dataset-path /your/path/to/dataset --max-files 100
The processed Lerobot dataset will be stored defaultly in your HF_HOME/lerobot/dataset-name
.
- Organize the functional code sections.
- Reconstruct the efficient, user-friendly, and comprehensive evaluation framework.
- Manage the automatic data workflow for existing tasks.
- Improve the DSL of skill libarary.
- Release the trejectory and evaluation scripts.
- Test the interface of humanoid and dual-arm manipulation.
- Release the left few tasks not released in preview version.
- Leaderboard of VLAs and VLMs in the standard evaluation
- Release standard evaluation datasets/episodes, in different dimension and difficulty level.
- Release standard finetune dataset.
- Integrate the commonly used VLA models for facilitate replication. (Continously update)
VLABench adopts a flexible modular framework for task construction, offering high adaptability. You can follow the process outlined in tutorial 6.
VLABench currently provides standard benchmark datasets, focusing on generalization across multiple dimensions. In the VLABench/configs/evaluation/tracks directory, we have set up multiple benchmark sets across different dimensions. These configs ensure that different models can be fairly compared under the same episodes on different machines.
Track | Descrition |
---|---|
track_1_in_distribution | Evaluation of the policy's task learning ability, requiring it to fit in-domain episodes with a small and diverse set of data. |
track_2_cross_categroy | Evaluation of the policy's generalization ability at the object category level & instance level, requiring visual generalization capability. |
track_3_common_sense | Evaluation of the policy's application of common sense, requiring the use of common sense understanding for describing the target. |
track_4_semantic_instruction | Evaluation of the policy's ability to understand complex semantics involves instructions that are rich in contextual or semantic information. |
track_5_cross_task | Evaluation of the policy's ability to transfer skills across tasks is kept open in this setting, allowing users to choose training tasks and evaluation tasks according to their needs. |
track_6_unseen_texture | Evaluation of the policy's visual robustness, involving episodes with different backgrounds and table textures in this setting. |
NOTICE: The evaluation can also be done by directly sampling episodes from the environment. This evaluation method is more flexible, but there is a risk of improperly initialized episodes. We recommend using the 'evaluation_tracks' method for evaluation.
- Evaluate OpenVLA
Before evaluate your finetuned OpenVLA, please compute the norm_stat on your dataset and place it to VLABench/configs/model/openvla_config.json
Run the evaluation scripts by
python scirpts/evaluate_policy.py --n-sample 20 --model openvla --model_ckpt xx --loar_ckpt xx --eval_track track_1_in_distribution --tasks task1, task2 ...
- Evaluate Openpi
Please use git submodule update --init --recursive
to ensure that you have correctly installed the repositories for the other models.
For openpi, you should create a virtual env with uv
and run the server policy. Then, you can evaluate the finetuned openpi on VLABench. Please refer here for example.
- Continously integrate more policies...
When you encounter an issue, you can first refer to the document. Feel free to open a new issue if needed.
@misc{zhang2024vlabench,
title={VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks},
author={Shiduo Zhang and Zhe Xu and Peiju Liu and Xiaopeng Yu and Yuan Li and Qinghui Gao and Zhaoye Fei and Zhangyue Yin and Zuxuan Wu and Yu-Gang Jiang and Xipeng Qiu},
year={2024},
eprint={2412.18194},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2412.18194},
}