Improve model card: Add metadata and update links
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,19 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<p align="center">
|
| 2 |
<img src="rynnvla-002/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
|
| 3 |
<p>
|
| 4 |
|
| 5 |
-
<h3 align="center"><a href="" style="color:#9C276A">
|
| 6 |
RynnVLA-002: A Unified Vision-Language-Action and World Model</a></h3>
|
| 7 |
-
<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </
|
| 8 |
|
| 9 |
|
| 10 |
<h5 align="center">
|
| 11 |
|
| 12 |
-
[](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002)
|
| 14 |
-
[](./LICENSE)
|
| 15 |
</h5>
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## 🌟 Introduction
|
| 19 |
RynnVLA-002 is an autoregressive action world model that unifies action and image understanding and generation. RynnVLA-002 intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework. Compared to WorldVLA, RynnVLA-002 adds a continous Action Transformer, wrist camera input and generation, and state input. RynnVLA-002 achieves 97.4% success rate on LIBERO benchmark.
|
|
@@ -23,7 +34,39 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
|
|
| 23 |
</div>
|
| 24 |
<br>
|
| 25 |
|
| 26 |
-
## Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
### VLA Model (256 * 256)
|
| 28 |
|
| 29 |
| Model | HF Link | Continous Action SR (%) | Discrete Action SR (%) |
|
|
@@ -32,6 +75,7 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
|
|
| 32 |
| LIBERO-Object | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_object) | 99.8 | 96.8 |
|
| 33 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_goal) | 96.4 | 94.6 |
|
| 34 |
| LIBERO-Long | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_10) | 94.4 | 87.6 |
|
|
|
|
| 35 |
|
| 36 |
### World Model (512 * 512)
|
| 37 |
|
|
@@ -56,6 +100,288 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
|
|
| 56 |
| Action World Model | [Alibaba-DAMO-Academy/RynnVLA-002/Action_World_model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/Action_World_model_512/libero_10) | 427.86 | 19.36 | 72.19 | 27.78 |
|
| 57 |
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
## License <a name="license"></a>
|
| 60 |
|
| 61 |
All assets and code are under the [Apache 2.0 license](./LICENSE) unless specified otherwise.
|
|
@@ -63,10 +389,10 @@ All assets and code are under the [Apache 2.0 license](./LICENSE) unless specifi
|
|
| 63 |
## Citation <a name="citation"></a>
|
| 64 |
If you find the project helpful for your research, please consider citing our paper:
|
| 65 |
```bibtex
|
| 66 |
-
@article{
|
| 67 |
-
title={
|
| 68 |
-
author={Cen, Jun and
|
| 69 |
-
journal={arXiv preprint arXiv:},
|
| 70 |
year={2025}
|
| 71 |
}
|
| 72 |
```
|
|
@@ -76,7 +402,7 @@ If you find the project helpful for your research, please consider citing our pa
|
|
| 76 |
<!-- may -->
|
| 77 |
> [**RynnVLA-001: A Vision-Language-Action Model Boosted by Generative Priors**](https://github.com/alibaba-damo-academy/RynnVLA-001) <br>
|
| 78 |
> Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Jiayan Guo, Kexiang Wang, Kehan Li, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li <br>
|
| 79 |
-
[](https://github.com/alibaba-damo-academy/RynnVLA-001) [](https://github.com/alibaba-damo-academy/RynnVLA-001) [ <br>
|
| 82 |
> Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li <br>
|
|
@@ -89,4 +415,4 @@ If you find the project helpful for your research, please consider citing our pa
|
|
| 89 |
</p></details>
|
| 90 |
|
| 91 |
## Acknowledgment <a name="acknowledgment"></a>
|
| 92 |
-
This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
<p align="center">
|
| 8 |
<img src="rynnvla-002/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
|
| 9 |
<p>
|
| 10 |
|
| 11 |
+
<h3 align="center"><a href="https://huggingface.co/papers/2511.17502" style="color:#9C276A">
|
| 12 |
RynnVLA-002: A Unified Vision-Language-Action and World Model</a></h3>
|
| 13 |
+
<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h5>
|
| 14 |
|
| 15 |
|
| 16 |
<h5 align="center">
|
| 17 |
|
| 18 |
+
[\ud83d\udcda Paper](https://huggingface.co/papers/2511.17502) - [\ud83d\udcbb Code](https://github.com/alibaba-damo-academy/RynnVLA-002) - [\ud83c\udfe0 Project Page](https://rynnvla.github.io) - [](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002) - [](./LICENSE)
|
|
|
|
|
|
|
| 19 |
</h5>
|
| 20 |
|
| 21 |
+
<div align="center"><video src="https://github.com/user-attachments/assets/a09f6b8b-7707-4478-b069-2de1629c0b83" width="800" autoplay loop muted></div>
|
| 22 |
+
|
| 23 |
+
## 📰 News
|
| 24 |
+
|
| 25 |
+
* **[2025.11.10]** Upgrade WorldVLA to RynnVLA-002. Release models, training code and evaluation code on LIBERO simulation benchmark and real-world LeRobot experiments.
|
| 26 |
+
* **[2025.06.23]** Release models, training code and evaluation code on LIBERO action generation benchmark of WorldVLA.
|
| 27 |
+
|
| 28 |
|
| 29 |
## 🌟 Introduction
|
| 30 |
RynnVLA-002 is an autoregressive action world model that unifies action and image understanding and generation. RynnVLA-002 intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework. Compared to WorldVLA, RynnVLA-002 adds a continous Action Transformer, wrist camera input and generation, and state input. RynnVLA-002 achieves 97.4% success rate on LIBERO benchmark.
|
|
|
|
| 34 |
</div>
|
| 35 |
<br>
|
| 36 |
|
| 37 |
+
### VLA Model Results (Text + Image -> Action)
|
| 38 |
+
VLA model generates actions given the text instruction and image observations.
|
| 39 |
+
|
| 40 |
+
| | | |
|
| 41 |
+
| :-----: | :-----: | :-----: |
|
| 42 |
+
|  |  |  |
|
| 43 |
+
<br>
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
### World Model Results (Action + Image -> Image)
|
| 48 |
+
World Model generates the next frame given the current frame and action control.
|
| 49 |
+
|
| 50 |
+
| | | |
|
| 51 |
+
| :-----: | :-----: | :-----: |
|
| 52 |
+
|  |  |  |
|
| 53 |
+
|  |  |  |
|
| 54 |
+
<br>
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
## 🛠️ Requirements and Installation
|
| 58 |
+
```
|
| 59 |
+
git clone https://github.com/alibaba-damo-academy/RynnVLA-002.git
|
| 60 |
+
cd RynnVLA-002
|
| 61 |
+
pip install -r requirements.txt
|
| 62 |
+
pip install flash-attn --no-build-isolation
|
| 63 |
+
pip install -e .
|
| 64 |
+
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
|
| 65 |
+
cd LIBERO
|
| 66 |
+
pip install -e .
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## :earth_americas: Model Zoo
|
| 70 |
### VLA Model (256 * 256)
|
| 71 |
|
| 72 |
| Model | HF Link | Continous Action SR (%) | Discrete Action SR (%) |
|
|
|
|
| 75 |
| LIBERO-Object | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_object) | 99.8 | 96.8 |
|
| 76 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_goal) | 96.4 | 94.6 |
|
| 77 |
| LIBERO-Long | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_10) | 94.4 | 87.6 |
|
| 78 |
+
<br>
|
| 79 |
|
| 80 |
### World Model (512 * 512)
|
| 81 |
|
|
|
|
| 100 |
| Action World Model | [Alibaba-DAMO-Academy/RynnVLA-002/Action_World_model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/Action_World_model_512/libero_10) | 427.86 | 19.36 | 72.19 | 27.78 |
|
| 101 |
|
| 102 |
|
| 103 |
+
## 🗝️ VLA Model Training on LIBERO
|
| 104 |
+
We evaluate four tasks of the LIBERO benchmark, including [spatial, obejct, goal, 10]. Here we take LIEBRO goal and 256 resolution as an example.
|
| 105 |
+
|
| 106 |
+
We offer two types of training pipelines:
|
| 107 |
+
|
| 108 |
+
- `Pretokenize`: This pipeline preprocesses all the training data by tokenizing it into tokens before the training begins.
|
| 109 |
+
- `NoPretokenize`: This pipeline performs tokenization dynamically during the training process.
|
| 110 |
+
|
| 111 |
+
Both pipelines begin by filtering out no-operation actions like [OpenVLA](https://github.com/openvla/openvla).
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
cd rynnvla-002/libero_util
|
| 115 |
+
python regenerate_libero_dataset_filter_no_op.py \
|
| 116 |
+
--libero_task_suite libero_goal \
|
| 117 |
+
--libero_raw_data_dir ../processed_data/Libero/libero_goal \
|
| 118 |
+
--libero_target_dir ../processed_data/libero_goal_no_noops_t_256 \
|
| 119 |
+
--image_resolution 256
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
After filtering, you can choose between the `Pretokenize` or `NoPretokenize` training pipeline. The `Pretokenize` pipeline offers faster training speeds, while the `NoPretokenize` option eliminates the need for preprocessing.
|
| 123 |
+
|
| 124 |
+
#### Step 0: Download the Chameleon weights
|
| 125 |
+
Download the Chameleon [tokenizer](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/chameleon/tokenizer), [base-model](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/base_model) and [starting point](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/chameleon/starting_point) weights, put them under the `rynnvla-002/ckpts/chameleon/tokenizer`, `rynnvla-002/ckpts/chameleon/base_model`, and `rynnvla-002/ckpts/starting_point`.
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
### Pipeline1: Pretokenize
|
| 129 |
+
|
| 130 |
+
#### Step 1: Libero Data Preparation
|
| 131 |
+
|
| 132 |
+
After filtering out no-operation actions, save all images and actions.
|
| 133 |
+
```bash
|
| 134 |
+
python regenerate_libero_dataset_save_img_action_state_wrist.py \
|
| 135 |
+
--libero_task_suite libero_goal \
|
| 136 |
+
--image_resolution 256 \
|
| 137 |
+
--raw_data_dir ../processed_data/libero_goal_no_noops_t_256 \
|
| 138 |
+
--save_dir ../processed_data/libero_goal_image_state_action_t_256
|
| 139 |
+
```
|
| 140 |
+
Next, generate the conversations data for the Chameleon model. The VLA model conversations are in the following format:
|
| 141 |
+
```json
|
| 142 |
+
{
|
| 143 |
+
"conversations": [
|
| 144 |
+
{
|
| 145 |
+
"from": "human",
|
| 146 |
+
"value": "What action should the robot take to open the middle drawer of the cabinet?<|state|><|image|><|image|><|image|><|image|>"
|
| 147 |
+
},
|
| 148 |
+
{
|
| 149 |
+
"from": "gpt",
|
| 150 |
+
"value": "<|action|><|action|><|action|><|action|><|action|>"
|
| 151 |
+
}
|
| 152 |
+
],
|
| 153 |
+
"image": [
|
| 154 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_0.png",
|
| 155 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_0.png",
|
| 156 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_1.png",
|
| 157 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_1.png"
|
| 158 |
+
],
|
| 159 |
+
"action": [
|
| 160 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_1.npy",
|
| 161 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_2.npy",
|
| 162 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_3.npy",
|
| 163 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_4.npy",
|
| 164 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_5.npy"
|
| 165 |
+
],
|
| 166 |
+
"state": [
|
| 167 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/eef_gripper_state/eef_gripper_state_1.npy"
|
| 168 |
+
]
|
| 169 |
+
}
|
| 170 |
+
```
|
| 171 |
+
The world model conversations are in the following format:
|
| 172 |
+
```json
|
| 173 |
+
{
|
| 174 |
+
"conversations": [
|
| 175 |
+
{
|
| 176 |
+
"from": "human",
|
| 177 |
+
"value": "Generate the next image based on the provided sequence of historical images and corresponding actions.<|image|><|image|><|action|>"
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"from": "gpt",
|
| 181 |
+
"value": "<|image|><|image|>"
|
| 182 |
+
}
|
| 183 |
+
],
|
| 184 |
+
"image": [
|
| 185 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_0.png",
|
| 186 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_0.png",
|
| 187 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_1.png",
|
| 188 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_1.png"
|
| 189 |
+
],
|
| 190 |
+
"action": [
|
| 191 |
+
"../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_0.npy"
|
| 192 |
+
]
|
| 193 |
+
},
|
| 194 |
+
```
|
| 195 |
+
To validate the world model performance, we split all the libero dataset into train/val_ind/val_ood json files.
|
| 196 |
+
```bash
|
| 197 |
+
cd rynnvla-002/data
|
| 198 |
+
python action_state_model_conv_generation.py \
|
| 199 |
+
--base_dir ../processed_data/libero_goal_image_state_action_t_256 \
|
| 200 |
+
--his 2 \
|
| 201 |
+
--len_action 5 \
|
| 202 |
+
--task_name goal \
|
| 203 |
+
--resolution 256 \
|
| 204 |
+
--with_state \
|
| 205 |
+
--img_names imgs_third_view imgs_wrist \
|
| 206 |
+
--output_dir ../processed_data/convs
|
| 207 |
+
python world_model_bi_views_conv_generation.py \
|
| 208 |
+
--base_dir ../processed_data/libero_goal_image_state_action_t_256 \
|
| 209 |
+
--his 1 \
|
| 210 |
+
--task_name goal \
|
| 211 |
+
--resolution 256 \
|
| 212 |
+
--output_dir ../processed_data/convs
|
| 213 |
+
```
|
| 214 |
+
Finally, tokenize all the conversations into tokens and save them.
|
| 215 |
+
```bash
|
| 216 |
+
cd rynnvla-002/data
|
| 217 |
+
python pretoken_state_action_model.py --task goal --resolution 256 --with_state --img_names imgs_third_view imgs_wrist --his 2 --len_action 5 --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
|
| 218 |
+
python pretoken_world_model.py --task goal --resolution 256 --img_name imgs_third_view imgs_wrist --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
|
| 219 |
+
bash concate_record_libero.sh
|
| 220 |
+
python concate_action_world_model_data_libero.py --source_dir_patterns libero_goal_his_2_{}_third_view_wrist_w_state_5_256 libero_goal_his_1_{}_third_view_wrist_a2i_256 --all_patterns libero_goal_his_2_third_view_wrist_w_state_5_256_abiw
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
#### Step 2: Prepare data configs
|
| 224 |
+
Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_2_third_view_wrist_w_state_5_256_pretokenize.yaml`.
|
| 225 |
+
|
| 226 |
+
#### Step 3: Start training
|
| 227 |
+
Now you can start training with your training scripts:
|
| 228 |
+
```bash
|
| 229 |
+
# Libero goal, 256 resolution
|
| 230 |
+
cd rynnvla-002/exps_pretokenize
|
| 231 |
+
bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
### Pipeline2: NoPretokenize
|
| 238 |
+
#### Step 1: Prepare data configs
|
| 239 |
+
Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_2_third_view_wrist_w_state_5_256_nopretokenize.yaml`.
|
| 240 |
+
|
| 241 |
+
#### Step 2: Start training
|
| 242 |
+
```bash
|
| 243 |
+
# Libero goal, 256 resolution
|
| 244 |
+
cd rynnvla-002/exps_nopretokenize
|
| 245 |
+
bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
## ✅ VLA Model Evaluation on LIBERO
|
| 250 |
+
### Step 1: Prepare evaluation scripts
|
| 251 |
+
Set the `checkpoint_path` in the bash files in `rynnvla-002/evals_libero/` to the model path. You can download our trained in Model Zoo or train yourself.
|
| 252 |
+
|
| 253 |
+
### Step 2: Start evaluation
|
| 254 |
+
```bash
|
| 255 |
+
# Libero goal, 256 resolution, continous
|
| 256 |
+
cd rynnvla-002/evals_libero
|
| 257 |
+
bash eval_libero_goal_his_2_third_view_wrist_w_state_5_256_abiw_continous.sh
|
| 258 |
+
# Libero goal, 256 resolution, discrete
|
| 259 |
+
cd rynnvla-002/evals_libero
|
| 260 |
+
bash eval_libero_goal_his_2_third_view_wrist_w_state_5_256_abiw_discrete.sh
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
## 🗝️ World Model Training on LIBERO
|
| 265 |
+
For world model training, we set image_resolution to 512.
|
| 266 |
+
### Pipeline1: Pretokenize
|
| 267 |
+
#### Step 1: Libero Data Preparation
|
| 268 |
+
|
| 269 |
+
Preprocess the dataset as described above, ensuring the `image_resolution` is set to 512.
|
| 270 |
+
Finally, run the following command to concatenate tokens:
|
| 271 |
+
```bash
|
| 272 |
+
python concate_action_world_model_data_libero.py --source_dir_patterns libero_goal_his_1_train_third_view_wrist_a2i_512 --all_patterns libero_goal_his_1_train_third_view_wrist_a2i_512
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
#### Step 2: Prepare data configs
|
| 277 |
+
Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_1_third_view_wrist_512_only_worldmodel_pretokenize.yaml`.
|
| 278 |
+
|
| 279 |
+
#### Step 3: Start Training
|
| 280 |
+
```
|
| 281 |
+
cd rynnvla-002/exps_libero_world_model
|
| 282 |
+
bash libero_goal_his_1_third_view_wrist_512_pretokenize.sh
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
### Pipeline2: NoPretokenize
|
| 286 |
+
First, set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_1_train_third_view_wrist_512_only_worldmodel_nopretokenize.yaml`.
|
| 287 |
+
|
| 288 |
+
Then, start training:
|
| 289 |
+
```
|
| 290 |
+
cd rynnvla-002/exps_libero_world_model
|
| 291 |
+
bash libero_goal_his_1_third_view_wrist_512_nopretokenize.sh
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
## ✅ World Model Evaluation on LIBERO
|
| 295 |
+
We evaluate the world model performance on the valiation set, which is stored in `rynnvla-002/exps_libero_world_model/goal_val_ind_trajectory_paths.json`. If the path is not the same as yours, use `rynnvla-002/exps_libero_world_model/extract_world_model_val_ind_trj.py` to generate one. Then run the evaluation:
|
| 296 |
+
```
|
| 297 |
+
cd rynnvla-002/exps_libero_world_model
|
| 298 |
+
bash eval_world_model_goal.sh
|
| 299 |
+
```
|
| 300 |
+
Then calculate the generation performance of world model and action world model:
|
| 301 |
+
```
|
| 302 |
+
python calculate_world_model_performance.py \
|
| 303 |
+
--folder_world_model "" \
|
| 304 |
+
--folder_action_world_model ""
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
## 🗝️ Training on LeRobot
|
| 308 |
+
|
| 309 |
+
#### Step 1: Lerobot to HDF5
|
| 310 |
+
|
| 311 |
+
We use HDF5 format data. Therefore, if you collect data in Lerobot format, you can follow the following command to process it into HDF5 format:
|
| 312 |
+
```
|
| 313 |
+
cd rynnvla-002/data_lerobot
|
| 314 |
+
python lerobot_to_hdf5.py \
|
| 315 |
+
--lerobot_input_dir {lerobot_input_dir}
|
| 316 |
+
--hdf5_output_dir {hdf5_output_dir}
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
#### Step 2: HDF5 to raw data
|
| 320 |
+
Put all HDF5 files in a json file, see `rynnvla-002/data_lerobot/modified_data_final.json` as an example. Extract the raw front camera data, wrist camera data, state data, and action data and save them all:
|
| 321 |
+
```
|
| 322 |
+
cd rynnvla-002/data_lerobot
|
| 323 |
+
python extract_all_data.py \
|
| 324 |
+
--json_path {json_path}
|
| 325 |
+
--output_dir {raw_data_output_dir}
|
| 326 |
+
--num_processes {num_processes to accelerate}
|
| 327 |
+
```
|
| 328 |
+
|
| 329 |
+
#### Step 3: Generate conversation files
|
| 330 |
+
Generate the VLA model conversation file and world model conversation file:
|
| 331 |
+
```
|
| 332 |
+
cd rynnvla-002/lerobot_util
|
| 333 |
+
python action_model_conv_generation_w_2_abs_state_all_data.py --input_dir {raw_data_output_dir} --his 1 --len_action 20 --task_name vla_data --output_dir {conv_output_dir}
|
| 334 |
+
python world_model_conv_generation_w_2_abs_front_all_data.py --input_dir {raw_data_output_dir} --his 1 --task_name world_model_data --output_dir {conv_output_dir}
|
| 335 |
+
python world_model_conv_generation_w_2_abs_wrist_all_data.py --input_dir {raw_data_output_dir} --his 1 --task_name world_model_data --output_dir {conv_output_dir}
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
#### Step 4: Tokenize raw data based on conversation files
|
| 339 |
+
First, calculate the min and max value of action data and state data:
|
| 340 |
+
```
|
| 341 |
+
cd rynnvla-002/data_lerobot
|
| 342 |
+
python calculate_min_max_all_data_state.py {raw_data_output_dir}
|
| 343 |
+
python calculate_min_max_all_data_action.py {raw_data_output_dir}
|
| 344 |
+
```
|
| 345 |
+
Put the results at the beginning of `rynnvla-002/data_lerobot/item_processor.py`
|
| 346 |
+
Then, tokenize all training data and concate them:
|
| 347 |
+
```
|
| 348 |
+
python pretoken_lerobot_state.py \
|
| 349 |
+
--input_file {conv_output_dir}/libero_vla_data_his_1_train_img_state_abs_ck_1_256.json \
|
| 350 |
+
--output_dir {raw_data_output_dir}/tokens/vla_data \
|
| 351 |
+
--resolution 256 \
|
| 352 |
+
--tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
|
| 353 |
+
python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/vla_data --save_path {raw_data_output_dir}/tokens/vla_data/record.json
|
| 354 |
+
python pretoken_lerobot.py \
|
| 355 |
+
--input_file {conv_output_dir}/libero_world_model_data_his_1_train_a2i_512_abs_front_all_data.json \
|
| 356 |
+
--output_dir {raw_data_output_dir}/tokens/world_model_data_front \
|
| 357 |
+
--resolution 256 \
|
| 358 |
+
--tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
|
| 359 |
+
python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/world_model_data_front --save_path {raw_data_output_dir}/tokens/world_model_data_front/record.json
|
| 360 |
+
python pretoken_lerobot.py \
|
| 361 |
+
--input_file {conv_output_dir}/libero_world_model_data_his_1_train_a2i_512_abs_wrist_all_data.json \
|
| 362 |
+
--output_dir {raw_data_output_dir}/tokens/world_model_data_wrist \
|
| 363 |
+
--resolution 256 \
|
| 364 |
+
--tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
|
| 365 |
+
python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/world_model_data_wrist --save_path {raw_data_output_dir}/tokens/world_model_data_wrist/record.json
|
| 366 |
+
python concate_multi_record.py \
|
| 367 |
+
--input_files {raw_data_output_dir}/tokens/vla_data/record.json {raw_data_output_dir}/tokens/world_model_data_front/record.json {raw_data_output_dir}/tokens/world_model_data_wrist/record.json \
|
| 368 |
+
--output_file {raw_data_output_dir}/concate_tokens/lerobot_all.json
|
| 369 |
+
```
|
| 370 |
+
|
| 371 |
+
#### Step 5: Prepare data configs
|
| 372 |
+
Set the correct data path in the config files in `rynnvla-002/configs/lerobot/his_1_third_view_wrist_w_state_20_256_pretokenize.yaml`.
|
| 373 |
+
|
| 374 |
+
#### Step 6: Start training
|
| 375 |
+
Now you can start training with your training scripts:
|
| 376 |
+
```bash
|
| 377 |
+
cd rynnvla-002/exps_pretokenize
|
| 378 |
+
bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
|
| 379 |
+
```
|
| 380 |
+
|
| 381 |
+
## ✅ Inference using LeRobot
|
| 382 |
+
We provide the action generation function in `rynnvla-002/eval_solver_lerobot_action_head_state.py` and initilization script in `rynnvla-002/evals_lerobot/eval_7B_lerobot_action_head.sh`.
|
| 383 |
+
|
| 384 |
+
|
| 385 |
## License <a name="license"></a>
|
| 386 |
|
| 387 |
All assets and code are under the [Apache 2.0 license](./LICENSE) unless specified otherwise.
|
|
|
|
| 389 |
## Citation <a name="citation"></a>
|
| 390 |
If you find the project helpful for your research, please consider citing our paper:
|
| 391 |
```bibtex
|
| 392 |
+
@article{cen2025rynnvla,
|
| 393 |
+
title={RynnVLA-002: A Unified Vision-Language-Action and World Model},
|
| 394 |
+
author={Cen, Jun and Huang, Siteng and Yuan, Yuqian and Yuan, Hangjie and Yu, Chaohui and Jiang, Yuming and Guo, Jiayan and Li, Kehan and Luo, Hao and Wang, Fan and Li, Xin and Zhao, Deli and Chen, Hao},
|
| 395 |
+
journal={arXiv preprint arXiv:2511.17502},
|
| 396 |
year={2025}
|
| 397 |
}
|
| 398 |
```
|
|
|
|
| 402 |
<!-- may -->
|
| 403 |
> [**RynnVLA-001: A Vision-Language-Action Model Boosted by Generative Priors**](https://github.com/alibaba-damo-academy/RynnVLA-001) <br>
|
| 404 |
> Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Jiayan Guo, Kexiang Wang, Kehan Li, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li <br>
|
| 405 |
+
[](https://github.com/alibaba-damo-academy/RynnVLA-001) [](https://github.com/alibaba-damo-academy/RynnVLA-001) [](https://arxiv.org/abs/2509.15212)<be>
|
| 406 |
|
| 407 |
> [**RynnEC: Bringing MLLMs into Embodied World**](https://github.com/alibaba-damo-academy/RynnEC) <br>
|
| 408 |
> Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li <br>
|
|
|
|
| 415 |
</p></details>
|
| 416 |
|
| 417 |
## Acknowledgment <a name="acknowledgment"></a>
|
| 418 |
+
This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.
|