tomresan commited on
Commit
a0e41fe
·
verified ·
1 Parent(s): ef7e9f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+
5
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://compvis.github.io/dismo/)
6
+ [![Paper](https://img.shields.io/badge/arXiv-paper-b31b1b)](https://openreview.net/forum?id=jneVld5iZw)
7
+ [![Weights](https://img.shields.io/badge/HuggingFace-Weights-orange)](https://huggingface.co/CompVis/dismo)
8
+ <h2 align="center">
9
+ <i>DisMo</i>: Disentangled Motion Representations<br/>for Open-World Motion Transfer
10
+ </h2>
11
+ <div align="center">
12
+ <a href="https://www.linkedin.com/in/thomas-ressler-494758133/" target="_blank">Thomas Ressler-Antal</a> ·
13
+ <a href="https://ffundel.de/" target="_blank">Frank Fundel</a><sup>*</sup> ·
14
+ <a href="https://www.linkedin.com/in/malek-ben-alaya/" target="_blank">Malek Ben Alaya</a><sup>*</sup>
15
+ <br>
16
+ <a href="https://stefan-baumann.eu/" target="_blank">Stefan A. Baumann</a> ·
17
+ <a href="https://www.linkedin.com/in/felixmkrause/" target="_blank">Felix Krause</a> ·
18
+ <a href="https://www.linkedin.com/in/ming-gui-87b76a16b/" target="_blank">Ming Gui</a> ·
19
+ <a href="https://ommer-lab.com/people/ommer/" target="_blank">Björn Ommer</a>
20
+ </div>
21
+ <p align="center">
22
+ <b>CompVis @ LMU Munich, MCML</b>
23
+ <br/>
24
+ <i>* equal contribution</i>
25
+ <br/>
26
+ NeurIPS 2025 Spotlight
27
+ </p>
28
+
29
+
30
+ ![DisMo learns abstract motion representations that enable open-world motion transer](https://compvis.github.io/dismo/docs/static/images/teaser.png)
31
+
32
+
33
+ ## 📋 Overview
34
+ We present <b>DisMo</b>, a paradigm that learns a semantic motion representation space from videos that is disentangled from static content information such as appearance, structure, viewing angle and even object category. We leverage this invariance and condition off-the-shelf video models on extracted motion embeddings. This setup achieves state-of-the-art performance on open-world motion transfer with a high degree of transferability in cross-category and -viewpoint settings. Beyond that, DisMo's learned representations are suitable for downstream tasks such as zero-shot action classification.
35
+
36
+ ## 🛠️ Setup
37
+ We have tested our setup on `Ubuntu 22.04.4 LTS`.
38
+
39
+ First, clone the repository into your desired location:
40
+ ```
41
+ git clone [email protected]:CompVis/dismo.git
42
+ cd dismo
43
+ ```
44
+
45
+ We recommend using a package manager, <i>e.g.,</i> [Miniconda](https://www.anaconda.com/docs/getting-started/miniconda/install). When installed, you can create and activate a new environment:
46
+ ```
47
+ conda create -n dismo python=3.11
48
+ conda activate dismo
49
+ ```
50
+
51
+ Afterwards install PyTorch. We have tested this setup with `PyTorch 2.7.1` and `CUDA 12.6`:
52
+ ```
53
+ pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
54
+ ```
55
+ If you need to install an alternative version, <i>e.g.</i> due to incompatible CUDA versions, see the [official instructions](https://pytorch.org/get-started/locally/).
56
+
57
+ Finally, install all other packages:
58
+ ```
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ <i>(Optional)</i> We use the [torchcodec](https://github.com/meta-pytorch/torchcodec) package for data loading, which expects `ffmpeg` to be installed. If you plan to train DisMo yourself and you don't have a ffmpeg version installed yet, an easy way is to use `conda`:
63
+ ```
64
+ conda install ffmpeg
65
+ ```
66
+
67
+ ## 🚀 Usage
68
+ To use DisMo for motion transfer purposes, we provide the code and LoRA weights of an adapted CogVideoX-5B-I2V video model,conditioned on motion embeddings and text prompts. The simplest way to use it is via `torch.hub`:
69
+ ```
70
+ cogvideox = torch.hub.load("CompVis/DisMo", "cogvideox5b_i2v_large")
71
+ ```
72
+
73
+ Alternatively, you can also instantiate and load the model yourself:
74
+ ```
75
+ from dismo.video_model_finetuning.cogvideox import CogVideoXMotionAdapter_5B_TI2V_Large
76
+ cogvideox = CogVideoXMotionAdapter_5B_TI2V_Large()
77
+ state_dict = torch.load("/path/to/finetuned/cogvideox/checkpoint/cogvideox5b_i2v_large.pt")
78
+ cogvideox.load_state_dict(state_dict, strict=False)
79
+ cogvideox.requires_grad_(False)
80
+ cogvideox.eval()
81
+ ```
82
+
83
+ You can then use the model's `sample` function to generate new videos by transferring motion from `motion_videos` to `images`. Since CogVideoX is a text-to-video model at its core, we recommend to additionally provide describing `prompts` alongside the target images for better generation results:
84
+ ```
85
+ generated_videos = cogvideox.sample(
86
+ motion_videos=driving_videos,
87
+ images=target_images,
88
+ prompts=target_text_prompts,
89
+ )
90
+ ```
91
+ The sample function comes with some other arguments (e.g., classifier-free text guidance). Please have a look in the code for more details.
92
+
93
+ ### Motion Extraction
94
+ During motion transfer, the video model internally uses DisMo's pre-trained motion extractor for encoding input videos into motion embeddings. However, the motion extractor can also be used as a standalone model to extract sequences of motion embeddings from input videos. This might be useful for video analysis purposes or other downstream tasks. Once again, the easiest way to load the model is via `torch.hub`:
95
+ ```
96
+ motion_extractor = torch.hub.load("CompVis/DisMo", "motion_extractor_large")
97
+ ```
98
+
99
+ Similarly, you can also manually instantiate and load the model:
100
+ ```
101
+ from dismo.model import MotionExtractor_Large
102
+ motion_extractor = MotionExtractor_Large()
103
+ state_dict = torch.load("/path/to/motion/extractor/checkpoint/motion_extractor_large.pt")
104
+ motion_extractor.load_state_dict(state_dict)
105
+ motion_extractor.requires_grad_(False)
106
+ motion_extractor.eval()
107
+ ```
108
+
109
+ To extract motion sequences from arbitrarily long videos, we provide the `forward_sliding` function, which extracts embeddings consecutively in a sliding window fashion. This is necessary, since DisMo only saw video clips of length 8 during training:
110
+ ```
111
+ import torch
112
+
113
+ # videos are expected to have shape [B, T, H, W, C] in (-1, 1) range
114
+ dummy_video = torch.rand((B, num_frames, 256, 256, 3)).mul(2).sub(1)
115
+
116
+ # we get a motion embedding for each frame, except for the last 4
117
+ motion_embeddings = motion_extractor.forward_sliding(dummy_video)
118
+ ```
119
+ Note that the resulting motion embeddings have a temporal length of `num_frames - 4`, since the longest possible prediction distance was set to 4 during training.
120
+
121
+
122
+ ## 🔥 Training
123
+ If you want to train DisMo yourself, we provide a training script that is suitable for multi-gpu training. Please note that the script instantiates DisMo with default parameters. To train other variants (e.g., changing the width, depth, etc.) you must modify the `train.py` accordingly. This equally holds true for video model adaptation.
124
+
125
+ ### Data Preparation
126
+ DisMo needs unlabelled videos for training. This repository takes advantage of the [webdataset](https://github.com/webdataset/webdataset) library and format for efficient and scalable data loading. Please refer to their page for further instructions on how to shard your video files accordingly.
127
+
128
+ ### Launching Training
129
+ Single-GPU training can be launched via
130
+ ```shell
131
+ python train.py --data_paths /path/to/preprocessed/shards --out_dir output/test --compile True
132
+ ```
133
+ Similarly, multi-GPU training, e.g., on 2 GPUs, can be launched using torchrun:
134
+ ```shell
135
+ torchrun --nnodes 1 --nproc-per-node 2 train.py [...]
136
+ ```
137
+ Training can be continued from a previous checkpoint by specifying, e.g., `--load_checkpoint output/test/checkpoints/checkpoint_0100000.pt`.
138
+ Remove `--compile True` for significantly faster startup time at the cost of slower training & significantly increased VRAM usage.
139
+
140
+
141
+ ## 🤖 Models
142
+ We release the weights of our pre-trained motion extractor and LoRA weight of an adapted CogVideoX-5B-I2V model via [HuggingFace](https://huggingface.co/CompVis/DisMo) under the [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en) license. If you are interested in using our model weights commercially, please <a href="mailto:[email protected]">contact us</a>. We will release other model variants in the future, <i>e.g.,</i> more sophisticated fine-tuned video models. Due to legal concerns, we do not release the weights of the frame generator that was trained alongside the motion extractor.
143
+
144
+ ## Code Credit
145
+ - Some code is adapted from [flow-poke-transformer](https://github.com/CompVis/flow-poke-transformer) by Stefan A. Baumann et al. (LMU), which in turn adapts some code from [k-diffusion](https://github.com/crowsonkb/k-diffusion) by Katherine Crowson (MIT)
146
+ - The code for fine-tuning CogVideoX models is adapted from [CogKit](https://github.com/THUDM/CogKit) (Apache 2.0)
147
+ - The DINOv2 code is adapted from [minDinoV2](https://github.com/cloneofsimo/minDinoV2) by Simo Ryu, which is based on the [official implementation](https://github.com/facebookresearch/dinov2/) by Oquab et al. (Apache 2.0)
148
+
149
+ ## 🎓 Citation
150
+ If you find our work useful, please cite our paper:
151
+ ```bibtex
152
+ @inproceedings{resslerdismo,
153
+ title={DisMo: Disentangled Motion Representations for Open-World Motion Transfer},
154
+ author={Ressler-Antal, Thomas and Fundel, Frank and Alaya, Malek Ben and Baumann, Stefan Andreas and Krause, Felix and Gui, Ming and Ommer, Bj{\"o}rn},
155
+ booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
156
+ year={2025}
157
+ }
158
+ ```
159
+