Improve model card for VORTA: Add paper/code links, abstract, and detailed usage
Browse filesThis PR significantly enhances the model card for VORTA by adding:
- An explicit link to the official Hugging Face paper page.
- A direct link to the GitHub repository for easy access to the code.
- The comprehensive abstract of the paper, providing a detailed overview of the research.
- Detailed installation instructions and a "Sample Usage (Inference)" section with code snippets for running inference, directly adapted from the official GitHub repository. This makes it easier for users to get started.
- Acknowledgements and a BibTeX citation, ensuring proper attribution.
These additions make the model card more informative and user-friendly, providing a more complete documentation of the artifact.
README.md
CHANGED
|
@@ -1,27 +1,70 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
base_model:
|
| 4 |
- Wan-AI/Wan2.1-T2V-14B-Diffusers
|
| 5 |
- hunyuanvideo-community/HunyuanVideo
|
| 6 |
-
pipeline_tag: text-to-video
|
| 7 |
library_name: diffusers
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
# VORTA: Efficient Video Diffusion via Routing Sparse Attention
|
| 11 |
|
|
|
|
|
|
|
| 12 |
> TL;DR - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.
|
| 13 |
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
```bash
|
| 19 |
git lfs install
|
| 20 |
git clone [email protected]:anonymous728/VORTA
|
| 21 |
# mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
|
| 22 |
mv VORTA/wan-14B results/
|
| 23 |
```
|
| 24 |
-
_Other alternative methods to download the models can be found [here](https://huggingface.co/docs/hub/models-downloading#using-git)._
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Wan-AI/Wan2.1-T2V-14B-Diffusers
|
| 4 |
- hunyuanvideo-community/HunyuanVideo
|
|
|
|
| 5 |
library_name: diffusers
|
| 6 |
+
license: mit
|
| 7 |
+
pipeline_tag: text-to-video
|
| 8 |
---
|
| 9 |
|
| 10 |
# VORTA: Efficient Video Diffusion via Routing Sparse Attention
|
| 11 |
|
| 12 |
+
[\ud83d\udcda Paper](https://huggingface.co/papers/2505.18809) | [\ud83d\udcbb Code](https://github.com/wenhao728/VORTA)
|
| 13 |
+
|
| 14 |
> TL;DR - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.
|
| 15 |
|
| 16 |
+
## Abstract
|
| 17 |
+
Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings.
|
| 18 |
|
| 19 |
+
## Installation
|
| 20 |
+
Install Pytorch. We have tested the code with PyTorch 2.6.0 and CUDA 12.6, but it should work with other versions as well. You can install PyTorch using the following command:
|
| 21 |
+
```
|
| 22 |
+
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126
|
| 23 |
+
```
|
| 24 |
|
| 25 |
+
Install the dependencies:
|
| 26 |
+
```
|
| 27 |
+
python -m pip install -r requirements.txt
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
## Sample Usage (Inference)
|
| 31 |
+
We use the general scripts to demonstrate the usage of our method. You can find the detailed scripts for each model in the `scripts` folder of the [VORTA GitHub repository](https://github.com/wenhao728/VORTA):
|
| 32 |
+
- HunyuanVideo: `scripts/hunyuan/inference.sh`
|
| 33 |
+
- Wan 2.1: `scripts/wan/inference.sh`
|
| 34 |
+
|
| 35 |
+
First, download the ready-to-use router weights. Assuming this repository is cloned as `VORTA` from the GitHub repository:
|
| 36 |
```bash
|
| 37 |
git lfs install
|
| 38 |
git clone [email protected]:anonymous728/VORTA
|
| 39 |
# mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
|
| 40 |
mv VORTA/wan-14B results/
|
| 41 |
```
|
|
|
|
| 42 |
|
| 43 |
+
Run the video DiTs with VORTA for acceleration (example for `wan` model):
|
| 44 |
+
```bash
|
| 45 |
+
CUDA_VISIBLE_DEVICES=0 python scripts/wan/inference.py \
|
| 46 |
+
--pretrained_model_path Wan-AI/Wan2.1-T2V-14B-Diffusers \
|
| 47 |
+
--val_data_json_file prompt.json \
|
| 48 |
+
--output_dir results/wan-14B/vorta \
|
| 49 |
+
--resume_dir results/wan-14B/train \
|
| 50 |
+
--resume ckpt/step-000100 \
|
| 51 |
+
--enable_cpu_offload \
|
| 52 |
+
--seed 1234
|
| 53 |
+
```
|
| 54 |
+
For the `hunyuan` model, replace `wan` with `hunyuan` in the script path and output directory, and use `hunyuanvideo-community/HunyuanVideo` as the `--pretrained_model_path`.
|
| 55 |
+
|
| 56 |
+
You can edit the `prompts.json` or the `--val_data_json_file` option to change the text prompt. See the source code `scripts/<model_name>/inference.py` or use `python scripts/<model_name>/inference.py --help` command for more detailed explanations of the arguments.
|
| 57 |
+
|
| 58 |
+
## Acknowledgements
|
| 59 |
+
Thanks to the authors of the following repositories for their great works and open-sourcing the code and models: [Diffusers](https://github.com/huggingface/diffusers), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [Wan 2.1](https://github.com/Wan-Video/Wan2.1), [FastVideo](https://github.com/hao-ai-lab/FastVideo)
|
| 60 |
|
| 61 |
+
## Citation
|
| 62 |
+
If you find our work helpful or inspiring, please feel free to cite it.
|
| 63 |
+
```bibtex
|
| 64 |
+
@article{wenhao728_2025_vorta,
|
| 65 |
+
author = {Wenhao and Li, Wenhao and Wang, Yanan and Zhao, Jizhao and Zheng, Wei},
|
| 66 |
+
title = {VORTA: Efficient Video Diffusion via Routing Sparse Attention},
|
| 67 |
+
journal = {arXiv preprint arXiv:2505.18809},
|
| 68 |
+
year = {2025}
|
| 69 |
+
}
|
| 70 |
+
```
|