Safetensors

Improve model card: Add metadata and update links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +338 -12
README.md CHANGED
@@ -1,19 +1,30 @@
 
 
 
 
 
 
1
  <p align="center">
2
  <img src="rynnvla-002/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
3
  <p>
4
 
5
- <h3 align="center"><a href="" style="color:#9C276A">
6
  RynnVLA-002: A Unified Vision-Language-Action and World Model</a></h3>
7
- <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>
8
 
9
 
10
  <h5 align="center">
11
 
12
- [![arXiv](https://img.shields.io/badge/Arxiv-2501.13106-AD1C18.svg?logo=arXiv)]()
13
- [![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002)
14
- [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE)
15
  </h5>
16
 
 
 
 
 
 
 
 
17
 
18
  ## 🌟 Introduction
19
  RynnVLA-002 is an autoregressive action world model that unifies action and image understanding and generation. RynnVLA-002 intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework. Compared to WorldVLA, RynnVLA-002 adds a continous Action Transformer, wrist camera input and generation, and state input. RynnVLA-002 achieves 97.4% success rate on LIBERO benchmark.
@@ -23,7 +34,39 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
23
  </div>
24
  <br>
25
 
26
- ## Model Zoo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ### VLA Model (256 * 256)
28
 
29
  | Model | HF Link | Continous Action SR (%) | Discrete Action SR (%) |
@@ -32,6 +75,7 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
32
  | LIBERO-Object | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_object) | 99.8 | 96.8 |
33
  | LIBERO-Goal | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_goal) | 96.4 | 94.6 |
34
  | LIBERO-Long | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_10) | 94.4 | 87.6 |
 
35
 
36
  ### World Model (512 * 512)
37
 
@@ -56,6 +100,288 @@ RynnVLA-002 is an autoregressive action world model that unifies action and imag
56
  | Action World Model | [Alibaba-DAMO-Academy/RynnVLA-002/Action_World_model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/Action_World_model_512/libero_10) | 427.86 | 19.36 | 72.19 | 27.78 |
57
 
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ## License <a name="license"></a>
60
 
61
  All assets and code are under the [Apache 2.0 license](./LICENSE) unless specified otherwise.
@@ -63,10 +389,10 @@ All assets and code are under the [Apache 2.0 license](./LICENSE) unless specifi
63
  ## Citation <a name="citation"></a>
64
  If you find the project helpful for your research, please consider citing our paper:
65
  ```bibtex
66
- @article{cen2025WorldVLA,
67
- title={WorldVLA: Towards Autoregressive Action World Model},
68
- author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and Zhao, Deli and Chen, Hao},
69
- journal={arXiv preprint arXiv:},
70
  year={2025}
71
  }
72
  ```
@@ -76,7 +402,7 @@ If you find the project helpful for your research, please consider citing our pa
76
  <!-- may -->
77
  > [**RynnVLA-001: A Vision-Language-Action Model Boosted by Generative Priors**](https://github.com/alibaba-damo-academy/RynnVLA-001) <br>
78
  > Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Jiayan Guo, Kexiang Wang, Kehan Li, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li <br>
79
- [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/alibaba-damo-academy/RynnVLA-001) [![github](https://img.shields.io/github/stars/alibaba-damo-academy/RynnVLA-001.svg?style=social)](https://github.com/alibaba-damo-academy/RynnVLA-001) [![arXiv](https://img.shields.io/badge/Arxiv-2508.14160-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2509.15212)<be>
80
 
81
  > [**RynnEC: Bringing MLLMs into Embodied World**](https://github.com/alibaba-damo-academy/RynnEC) <br>
82
  > Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li <br>
@@ -89,4 +415,4 @@ If you find the project helpful for your research, please consider citing our pa
89
  </p></details>
90
 
91
  ## Acknowledgment <a name="acknowledgment"></a>
92
- This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ library_name: transformers
5
+ ---
6
+
7
  <p align="center">
8
  <img src="rynnvla-002/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
9
  <p>
10
 
11
+ <h3 align="center"><a href="https://huggingface.co/papers/2511.17502" style="color:#9C276A">
12
  RynnVLA-002: A Unified Vision-Language-Action and World Model</a></h3>
13
+ <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h5>
14
 
15
 
16
  <h5 align="center">
17
 
18
+ [\ud83d\udcda Paper](https://huggingface.co/papers/2511.17502) - [\ud83d\udcbb Code](https://github.com/alibaba-damo-academy/RynnVLA-002) - [\ud83c\udfe0 Project Page](https://rynnvla.github.io) - [![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002) - [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](./LICENSE)
 
 
19
  </h5>
20
 
21
+ <div align="center"><video src="https://github.com/user-attachments/assets/a09f6b8b-7707-4478-b069-2de1629c0b83" width="800" autoplay loop muted></div>
22
+
23
+ ## 📰 News
24
+
25
+ * **[2025.11.10]** Upgrade WorldVLA to RynnVLA-002. Release models, training code and evaluation code on LIBERO simulation benchmark and real-world LeRobot experiments.
26
+ * **[2025.06.23]** Release models, training code and evaluation code on LIBERO action generation benchmark of WorldVLA.
27
+
28
 
29
  ## 🌟 Introduction
30
  RynnVLA-002 is an autoregressive action world model that unifies action and image understanding and generation. RynnVLA-002 intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework. Compared to WorldVLA, RynnVLA-002 adds a continous Action Transformer, wrist camera input and generation, and state input. RynnVLA-002 achieves 97.4% success rate on LIBERO benchmark.
 
34
  </div>
35
  <br>
36
 
37
+ ### VLA Model Results (Text + Image -> Action)
38
+ VLA model generates actions given the text instruction and image observations.
39
+
40
+ | | | |
41
+ | :-----: | :-----: | :-----: |
42
+ | ![Open drawer](rynnvla-002/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif) | ![Pick up soup](rynnvla-002/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif) | ![Pick up bowl](rynnvla-002/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif) |
43
+ <br>
44
+
45
+
46
+
47
+ ### World Model Results (Action + Image -> Image)
48
+ World Model generates the next frame given the current frame and action control.
49
+
50
+ | | | |
51
+ | :-----: | :-----: | :-----: |
52
+ | ![Open drawer](rynnvla-002/assets/pickuptheblackbowlandplaceitontheplate_front.gif) | ![Pick up soup](rynnvla-002/assets/put_the_cream_cheese_box_in_the_basket_front.gif) | ![Pick up bowl](rynnvla-002/assets/putthebowlontopofthecabinet_front.gif) |
53
+ | ![Open drawer](rynnvla-002/assets/pickuptheblackbowlandplaceitontheplate_wrist.gif) | ![Pick up soup](rynnvla-002/assets/put_the_cream_cheese_box_in_the_basket_wrist.gif) | ![Pick up bowl](rynnvla-002/assets/putthebowlontopofthecabinet_wrist.gif) |
54
+ <br>
55
+
56
+
57
+ ## 🛠️ Requirements and Installation
58
+ ```
59
+ git clone https://github.com/alibaba-damo-academy/RynnVLA-002.git
60
+ cd RynnVLA-002
61
+ pip install -r requirements.txt
62
+ pip install flash-attn --no-build-isolation
63
+ pip install -e .
64
+ git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
65
+ cd LIBERO
66
+ pip install -e .
67
+ ```
68
+
69
+ ## :earth_americas: Model Zoo
70
  ### VLA Model (256 * 256)
71
 
72
  | Model | HF Link | Continous Action SR (%) | Discrete Action SR (%) |
 
75
  | LIBERO-Object | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_object) | 99.8 | 96.8 |
76
  | LIBERO-Goal | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_goal) | 96.4 | 94.6 |
77
  | LIBERO-Long | [Alibaba-DAMO-Academy/RynnVLA-002/VLA_model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/VLA_model_256/libero_10) | 94.4 | 87.6 |
78
+ <br>
79
 
80
  ### World Model (512 * 512)
81
 
 
100
  | Action World Model | [Alibaba-DAMO-Academy/RynnVLA-002/Action_World_model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/RynnVLA-002/tree/main/Action_World_model_512/libero_10) | 427.86 | 19.36 | 72.19 | 27.78 |
101
 
102
 
103
+ ## 🗝️ VLA Model Training on LIBERO
104
+ We evaluate four tasks of the LIBERO benchmark, including [spatial, obejct, goal, 10]. Here we take LIEBRO goal and 256 resolution as an example.
105
+
106
+ We offer two types of training pipelines:
107
+
108
+ - `Pretokenize`: This pipeline preprocesses all the training data by tokenizing it into tokens before the training begins.
109
+ - `NoPretokenize`: This pipeline performs tokenization dynamically during the training process.
110
+
111
+ Both pipelines begin by filtering out no-operation actions like [OpenVLA](https://github.com/openvla/openvla).
112
+
113
+ ```bash
114
+ cd rynnvla-002/libero_util
115
+ python regenerate_libero_dataset_filter_no_op.py \
116
+ --libero_task_suite libero_goal \
117
+ --libero_raw_data_dir ../processed_data/Libero/libero_goal \
118
+ --libero_target_dir ../processed_data/libero_goal_no_noops_t_256 \
119
+ --image_resolution 256
120
+ ```
121
+
122
+ After filtering, you can choose between the `Pretokenize` or `NoPretokenize` training pipeline. The `Pretokenize` pipeline offers faster training speeds, while the `NoPretokenize` option eliminates the need for preprocessing.
123
+
124
+ #### Step 0: Download the Chameleon weights
125
+ Download the Chameleon [tokenizer](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/chameleon/tokenizer), [base-model](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/base_model) and [starting point](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/chameleon/starting_point) weights, put them under the `rynnvla-002/ckpts/chameleon/tokenizer`, `rynnvla-002/ckpts/chameleon/base_model`, and `rynnvla-002/ckpts/starting_point`.
126
+
127
+
128
+ ### Pipeline1: Pretokenize
129
+
130
+ #### Step 1: Libero Data Preparation
131
+
132
+ After filtering out no-operation actions, save all images and actions.
133
+ ```bash
134
+ python regenerate_libero_dataset_save_img_action_state_wrist.py \
135
+ --libero_task_suite libero_goal \
136
+ --image_resolution 256 \
137
+ --raw_data_dir ../processed_data/libero_goal_no_noops_t_256 \
138
+ --save_dir ../processed_data/libero_goal_image_state_action_t_256
139
+ ```
140
+ Next, generate the conversations data for the Chameleon model. The VLA model conversations are in the following format:
141
+ ```json
142
+ {
143
+ "conversations": [
144
+ {
145
+ "from": "human",
146
+ "value": "What action should the robot take to open the middle drawer of the cabinet?<|state|><|image|><|image|><|image|><|image|>"
147
+ },
148
+ {
149
+ "from": "gpt",
150
+ "value": "<|action|><|action|><|action|><|action|><|action|>"
151
+ }
152
+ ],
153
+ "image": [
154
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_0.png",
155
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_0.png",
156
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_1.png",
157
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_1.png"
158
+ ],
159
+ "action": [
160
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_1.npy",
161
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_2.npy",
162
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_3.npy",
163
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_4.npy",
164
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_5.npy"
165
+ ],
166
+ "state": [
167
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/eef_gripper_state/eef_gripper_state_1.npy"
168
+ ]
169
+ }
170
+ ```
171
+ The world model conversations are in the following format:
172
+ ```json
173
+ {
174
+ "conversations": [
175
+ {
176
+ "from": "human",
177
+ "value": "Generate the next image based on the provided sequence of historical images and corresponding actions.<|image|><|image|><|action|>"
178
+ },
179
+ {
180
+ "from": "gpt",
181
+ "value": "<|image|><|image|>"
182
+ }
183
+ ],
184
+ "image": [
185
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_0.png",
186
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_0.png",
187
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_third_view/image_1.png",
188
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/imgs_wrist/image_1.png"
189
+ ],
190
+ "action": [
191
+ "../processed_data/libero_goal_image_state_action_t_256/open_the_middle_drawer_of_the_cabinet/trj_0/action/action_0.npy"
192
+ ]
193
+ },
194
+ ```
195
+ To validate the world model performance, we split all the libero dataset into train/val_ind/val_ood json files.
196
+ ```bash
197
+ cd rynnvla-002/data
198
+ python action_state_model_conv_generation.py \
199
+ --base_dir ../processed_data/libero_goal_image_state_action_t_256 \
200
+ --his 2 \
201
+ --len_action 5 \
202
+ --task_name goal \
203
+ --resolution 256 \
204
+ --with_state \
205
+ --img_names imgs_third_view imgs_wrist \
206
+ --output_dir ../processed_data/convs
207
+ python world_model_bi_views_conv_generation.py \
208
+ --base_dir ../processed_data/libero_goal_image_state_action_t_256 \
209
+ --his 1 \
210
+ --task_name goal \
211
+ --resolution 256 \
212
+ --output_dir ../processed_data/convs
213
+ ```
214
+ Finally, tokenize all the conversations into tokens and save them.
215
+ ```bash
216
+ cd rynnvla-002/data
217
+ python pretoken_state_action_model.py --task goal --resolution 256 --with_state --img_names imgs_third_view imgs_wrist --his 2 --len_action 5 --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
218
+ python pretoken_world_model.py --task goal --resolution 256 --img_name imgs_third_view imgs_wrist --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
219
+ bash concate_record_libero.sh
220
+ python concate_action_world_model_data_libero.py --source_dir_patterns libero_goal_his_2_{}_third_view_wrist_w_state_5_256 libero_goal_his_1_{}_third_view_wrist_a2i_256 --all_patterns libero_goal_his_2_third_view_wrist_w_state_5_256_abiw
221
+ ```
222
+
223
+ #### Step 2: Prepare data configs
224
+ Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_2_third_view_wrist_w_state_5_256_pretokenize.yaml`.
225
+
226
+ #### Step 3: Start training
227
+ Now you can start training with your training scripts:
228
+ ```bash
229
+ # Libero goal, 256 resolution
230
+ cd rynnvla-002/exps_pretokenize
231
+ bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
232
+ ```
233
+
234
+
235
+
236
+
237
+ ### Pipeline2: NoPretokenize
238
+ #### Step 1: Prepare data configs
239
+ Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_2_third_view_wrist_w_state_5_256_nopretokenize.yaml`.
240
+
241
+ #### Step 2: Start training
242
+ ```bash
243
+ # Libero goal, 256 resolution
244
+ cd rynnvla-002/exps_nopretokenize
245
+ bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
246
+ ```
247
+
248
+
249
+ ## ✅ VLA Model Evaluation on LIBERO
250
+ ### Step 1: Prepare evaluation scripts
251
+ Set the `checkpoint_path` in the bash files in `rynnvla-002/evals_libero/` to the model path. You can download our trained in Model Zoo or train yourself.
252
+
253
+ ### Step 2: Start evaluation
254
+ ```bash
255
+ # Libero goal, 256 resolution, continous
256
+ cd rynnvla-002/evals_libero
257
+ bash eval_libero_goal_his_2_third_view_wrist_w_state_5_256_abiw_continous.sh
258
+ # Libero goal, 256 resolution, discrete
259
+ cd rynnvla-002/evals_libero
260
+ bash eval_libero_goal_his_2_third_view_wrist_w_state_5_256_abiw_discrete.sh
261
+ ```
262
+
263
+
264
+ ## 🗝️ World Model Training on LIBERO
265
+ For world model training, we set image_resolution to 512.
266
+ ### Pipeline1: Pretokenize
267
+ #### Step 1: Libero Data Preparation
268
+
269
+ Preprocess the dataset as described above, ensuring the `image_resolution` is set to 512.
270
+ Finally, run the following command to concatenate tokens:
271
+ ```bash
272
+ python concate_action_world_model_data_libero.py --source_dir_patterns libero_goal_his_1_train_third_view_wrist_a2i_512 --all_patterns libero_goal_his_1_train_third_view_wrist_a2i_512
273
+ ```
274
+
275
+
276
+ #### Step 2: Prepare data configs
277
+ Set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_1_third_view_wrist_512_only_worldmodel_pretokenize.yaml`.
278
+
279
+ #### Step 3: Start Training
280
+ ```
281
+ cd rynnvla-002/exps_libero_world_model
282
+ bash libero_goal_his_1_third_view_wrist_512_pretokenize.sh
283
+ ```
284
+
285
+ ### Pipeline2: NoPretokenize
286
+ First, set the correct data path in the config files in `rynnvla-002/configs/libero_goal/his_1_train_third_view_wrist_512_only_worldmodel_nopretokenize.yaml`.
287
+
288
+ Then, start training:
289
+ ```
290
+ cd rynnvla-002/exps_libero_world_model
291
+ bash libero_goal_his_1_third_view_wrist_512_nopretokenize.sh
292
+ ```
293
+
294
+ ## ✅ World Model Evaluation on LIBERO
295
+ We evaluate the world model performance on the valiation set, which is stored in `rynnvla-002/exps_libero_world_model/goal_val_ind_trajectory_paths.json`. If the path is not the same as yours, use `rynnvla-002/exps_libero_world_model/extract_world_model_val_ind_trj.py` to generate one. Then run the evaluation:
296
+ ```
297
+ cd rynnvla-002/exps_libero_world_model
298
+ bash eval_world_model_goal.sh
299
+ ```
300
+ Then calculate the generation performance of world model and action world model:
301
+ ```
302
+ python calculate_world_model_performance.py \
303
+ --folder_world_model "" \
304
+ --folder_action_world_model ""
305
+ ```
306
+
307
+ ## 🗝️ Training on LeRobot
308
+
309
+ #### Step 1: Lerobot to HDF5
310
+
311
+ We use HDF5 format data. Therefore, if you collect data in Lerobot format, you can follow the following command to process it into HDF5 format:
312
+ ```
313
+ cd rynnvla-002/data_lerobot
314
+ python lerobot_to_hdf5.py \
315
+ --lerobot_input_dir {lerobot_input_dir}
316
+ --hdf5_output_dir {hdf5_output_dir}
317
+ ```
318
+
319
+ #### Step 2: HDF5 to raw data
320
+ Put all HDF5 files in a json file, see `rynnvla-002/data_lerobot/modified_data_final.json` as an example. Extract the raw front camera data, wrist camera data, state data, and action data and save them all:
321
+ ```
322
+ cd rynnvla-002/data_lerobot
323
+ python extract_all_data.py \
324
+ --json_path {json_path}
325
+ --output_dir {raw_data_output_dir}
326
+ --num_processes {num_processes to accelerate}
327
+ ```
328
+
329
+ #### Step 3: Generate conversation files
330
+ Generate the VLA model conversation file and world model conversation file:
331
+ ```
332
+ cd rynnvla-002/lerobot_util
333
+ python action_model_conv_generation_w_2_abs_state_all_data.py --input_dir {raw_data_output_dir} --his 1 --len_action 20 --task_name vla_data --output_dir {conv_output_dir}
334
+ python world_model_conv_generation_w_2_abs_front_all_data.py --input_dir {raw_data_output_dir} --his 1 --task_name world_model_data --output_dir {conv_output_dir}
335
+ python world_model_conv_generation_w_2_abs_wrist_all_data.py --input_dir {raw_data_output_dir} --his 1 --task_name world_model_data --output_dir {conv_output_dir}
336
+ ```
337
+
338
+ #### Step 4: Tokenize raw data based on conversation files
339
+ First, calculate the min and max value of action data and state data:
340
+ ```
341
+ cd rynnvla-002/data_lerobot
342
+ python calculate_min_max_all_data_state.py {raw_data_output_dir}
343
+ python calculate_min_max_all_data_action.py {raw_data_output_dir}
344
+ ```
345
+ Put the results at the beginning of `rynnvla-002/data_lerobot/item_processor.py`
346
+ Then, tokenize all training data and concate them:
347
+ ```
348
+ python pretoken_lerobot_state.py \
349
+ --input_file {conv_output_dir}/libero_vla_data_his_1_train_img_state_abs_ck_1_256.json \
350
+ --output_dir {raw_data_output_dir}/tokens/vla_data \
351
+ --resolution 256 \
352
+ --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
353
+ python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/vla_data --save_path {raw_data_output_dir}/tokens/vla_data/record.json
354
+ python pretoken_lerobot.py \
355
+ --input_file {conv_output_dir}/libero_world_model_data_his_1_train_a2i_512_abs_front_all_data.json \
356
+ --output_dir {raw_data_output_dir}/tokens/world_model_data_front \
357
+ --resolution 256 \
358
+ --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
359
+ python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/world_model_data_front --save_path {raw_data_output_dir}/tokens/world_model_data_front/record.json
360
+ python pretoken_lerobot.py \
361
+ --input_file {conv_output_dir}/libero_world_model_data_his_1_train_a2i_512_abs_wrist_all_data.json \
362
+ --output_dir {raw_data_output_dir}/tokens/world_model_data_wrist \
363
+ --resolution 256 \
364
+ --tokenizer_path ../ckpts/models--Alpha-VLLM--Lumina-mGPT-7B-768/snapshots/9624463a82ea5ce814af9b561dcd08a31082c3af
365
+ python -u concate_record.py --sub_record_dir {raw_data_output_dir}/tokens/world_model_data_wrist --save_path {raw_data_output_dir}/tokens/world_model_data_wrist/record.json
366
+ python concate_multi_record.py \
367
+ --input_files {raw_data_output_dir}/tokens/vla_data/record.json {raw_data_output_dir}/tokens/world_model_data_front/record.json {raw_data_output_dir}/tokens/world_model_data_wrist/record.json \
368
+ --output_file {raw_data_output_dir}/concate_tokens/lerobot_all.json
369
+ ```
370
+
371
+ #### Step 5: Prepare data configs
372
+ Set the correct data path in the config files in `rynnvla-002/configs/lerobot/his_1_third_view_wrist_w_state_20_256_pretokenize.yaml`.
373
+
374
+ #### Step 6: Start training
375
+ Now you can start training with your training scripts:
376
+ ```bash
377
+ cd rynnvla-002/exps_pretokenize
378
+ bash libero_goal_his_2_third_view_wrist_w_state_5_256_abiw.sh
379
+ ```
380
+
381
+ ## ✅ Inference using LeRobot
382
+ We provide the action generation function in `rynnvla-002/eval_solver_lerobot_action_head_state.py` and initilization script in `rynnvla-002/evals_lerobot/eval_7B_lerobot_action_head.sh`.
383
+
384
+
385
  ## License <a name="license"></a>
386
 
387
  All assets and code are under the [Apache 2.0 license](./LICENSE) unless specified otherwise.
 
389
  ## Citation <a name="citation"></a>
390
  If you find the project helpful for your research, please consider citing our paper:
391
  ```bibtex
392
+ @article{cen2025rynnvla,
393
+ title={RynnVLA-002: A Unified Vision-Language-Action and World Model},
394
+ author={Cen, Jun and Huang, Siteng and Yuan, Yuqian and Yuan, Hangjie and Yu, Chaohui and Jiang, Yuming and Guo, Jiayan and Li, Kehan and Luo, Hao and Wang, Fan and Li, Xin and Zhao, Deli and Chen, Hao},
395
+ journal={arXiv preprint arXiv:2511.17502},
396
  year={2025}
397
  }
398
  ```
 
402
  <!-- may -->
403
  > [**RynnVLA-001: A Vision-Language-Action Model Boosted by Generative Priors**](https://github.com/alibaba-damo-academy/RynnVLA-001) <br>
404
  > Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Jiayan Guo, Kexiang Wang, Kehan Li, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li <br>
405
+ [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/alibaba-damo-academy/RynnVLA-001) [![github](https://img.shields.io/github/stars/alibaba-damo-academy/RynnVLA-001.svg?style=social)](https://github.com/alibaba-damo-academy/RynnVLA-001) [![arXiv](https://img.shields.io/badge/Arxiv-2509.15212-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2509.15212)<be>
406
 
407
  > [**RynnEC: Bringing MLLMs into Embodied World**](https://github.com/alibaba-damo-academy/RynnEC) <br>
408
  > Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li <br>
 
415
  </p></details>
416
 
417
  ## Acknowledgment <a name="acknowledgment"></a>
418
+ This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.