Improve model card for VORTA: Add paper/code links, abstract, and detailed usage

This PR significantly enhances the model card for VORTA by adding:
- An explicit link to the official Hugging Face paper page.
- A direct link to the GitHub repository for easy access to the code.
- The comprehensive abstract of the paper, providing a detailed overview of the research.
- Detailed installation instructions and a "Sample Usage (Inference)" section with code snippets for running inference, directly adapted from the official GitHub repository. This makes it easier for users to get started.
- Acknowledgements and a BibTeX citation, ensuring proper attribution.

These additions make the model card more informative and user-friendly, providing a more complete documentation of the artifact.

Files changed (1) hide show

README.md +49 -6

README.md CHANGED Viewed

@@ -1,27 +1,70 @@
 ---
-license: mit
 base_model:
 - Wan-AI/Wan2.1-T2V-14B-Diffusers
 - hunyuanvideo-community/HunyuanVideo
-pipeline_tag: text-to-video
 library_name: diffusers
 ---
 # VORTA: Efficient Video Diffusion via Routing Sparse Attention
 > TL;DR - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.
-## Quick Start
-1. Download the checkpoints into the `./results` directory under the VORTA GitHub code repository.
 ```bash
 git lfs install
 git clone [email protected]:anonymous728/VORTA
 # mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
 mv VORTA/wan-14B results/
 ```
-_Other alternative methods to download the models can be found [here](https://huggingface.co/docs/hub/models-downloading#using-git)._
-2. Follow the `README.md` instructions to run the sampling with speedup. 🤗

 ---
 base_model:
 - Wan-AI/Wan2.1-T2V-14B-Diffusers
 - hunyuanvideo-community/HunyuanVideo
 library_name: diffusers
+license: mit
+pipeline_tag: text-to-video
 ---
 # VORTA: Efficient Video Diffusion via Routing Sparse Attention
+[\ud83d\udcda Paper](https://huggingface.co/papers/2505.18809) | [\ud83d\udcbb Code](https://github.com/wenhao728/VORTA)
 > TL;DR - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss.
+## Abstract
+Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings.
+## Installation
+Install Pytorch. We have tested the code with PyTorch 2.6.0 and CUDA 12.6, but it should work with other versions as well. You can install PyTorch using the following command:
+```
+pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126
+```
+Install the dependencies:
+```
+python -m pip install -r requirements.txt
+```
+## Sample Usage (Inference)
+We use the general scripts to demonstrate the usage of our method. You can find the detailed scripts for each model in the `scripts` folder of the [VORTA GitHub repository](https://github.com/wenhao728/VORTA):
+- HunyuanVideo: `scripts/hunyuan/inference.sh`
+- Wan 2.1: `scripts/wan/inference.sh`
+First, download the ready-to-use router weights. Assuming this repository is cloned as `VORTA` from the GitHub repository:
 ```bash
 git lfs install
 git clone [email protected]:anonymous728/VORTA
 # mv VORTA/<model_name> results/, <model_name>: wan-14B, hunyuan; e.g.
 mv VORTA/wan-14B results/
 ```
+Run the video DiTs with VORTA for acceleration (example for `wan` model):
+```bash
+CUDA_VISIBLE_DEVICES=0 python scripts/wan/inference.py \
+    --pretrained_model_path Wan-AI/Wan2.1-T2V-14B-Diffusers \
+    --val_data_json_file prompt.json \
+    --output_dir results/wan-14B/vorta \
+    --resume_dir results/wan-14B/train \
+    --resume ckpt/step-000100 \
+    --enable_cpu_offload \
+    --seed 1234
+```
+For the `hunyuan` model, replace `wan` with `hunyuan` in the script path and output directory, and use `hunyuanvideo-community/HunyuanVideo` as the `--pretrained_model_path`.
+You can edit the `prompts.json` or the `--val_data_json_file` option to change the text prompt. See the source code `scripts/<model_name>/inference.py` or use `python scripts/<model_name>/inference.py --help` command for more detailed explanations of the arguments.
+## Acknowledgements
+Thanks to the authors of the following repositories for their great works and open-sourcing the code and models: [Diffusers](https://github.com/huggingface/diffusers), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [Wan 2.1](https://github.com/Wan-Video/Wan2.1), [FastVideo](https://github.com/hao-ai-lab/FastVideo)
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it.
+```bibtex
+@article{wenhao728_2025_vorta,
+  author = {Wenhao and Li, Wenhao and Wang, Yanan and Zhao, Jizhao and Zheng, Wei},
+  title = {VORTA: Efficient Video Diffusion via Routing Sparse Attention},
+  journal = {arXiv preprint arXiv:2505.18809},
+  year = {2025}
+}
+```