--- base_model: - Wan-AI/Wan2.1-T2V-14B-Diffusers - hunyuanvideo-community/HunyuanVideo library_name: diffusers license: mit pipeline_tag: text-to-video --- # VORTA: Efficient Video Diffusion via Routing Sparse Attention [\ud83d\udcda Paper](https://huggingface.co/papers/2505.18809) | [\ud83d\udcbb Code](https://github.com/wenhao728/VORTA) > TL;DR - VORTA accelerates video diffusion transformers by sparse attention and dynamic routing, achieving speedup with negligible quality loss. ## Abstract Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. ## Installation Install Pytorch. We have tested the code with PyTorch 2.6.0 and CUDA 12.6, but it should work with other versions as well. You can install PyTorch using the following command: ``` pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126 ``` Install the dependencies: ``` python -m pip install -r requirements.txt ``` ## Sample Usage (Inference) We use the general scripts to demonstrate the usage of our method. You can find the detailed scripts for each model in the `scripts` folder of the [VORTA GitHub repository](https://github.com/wenhao728/VORTA): - HunyuanVideo: `scripts/hunyuan/inference.sh` - Wan 2.1: `scripts/wan/inference.sh` First, download the ready-to-use router weights. Assuming this repository is cloned as `VORTA` from the GitHub repository: ```bash git lfs install git clone git@hf.co:anonymous728/VORTA # mv VORTA/ results/, : wan-14B, hunyuan; e.g. mv VORTA/wan-14B results/ ``` Run the video DiTs with VORTA for acceleration (example for `wan` model): ```bash CUDA_VISIBLE_DEVICES=0 python scripts/wan/inference.py \ --pretrained_model_path Wan-AI/Wan2.1-T2V-14B-Diffusers \ --val_data_json_file prompt.json \ --output_dir results/wan-14B/vorta \ --resume_dir results/wan-14B/train \ --resume ckpt/step-000100 \ --enable_cpu_offload \ --seed 1234 ``` For the `hunyuan` model, replace `wan` with `hunyuan` in the script path and output directory, and use `hunyuanvideo-community/HunyuanVideo` as the `--pretrained_model_path`. You can edit the `prompts.json` or the `--val_data_json_file` option to change the text prompt. See the source code `scripts//inference.py` or use `python scripts//inference.py --help` command for more detailed explanations of the arguments. ## Acknowledgements Thanks to the authors of the following repositories for their great works and open-sourcing the code and models: [Diffusers](https://github.com/huggingface/diffusers), [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), [Wan 2.1](https://github.com/Wan-Video/Wan2.1), [FastVideo](https://github.com/hao-ai-lab/FastVideo) ## Citation If you find our work helpful or inspiring, please feel free to cite it. ```bibtex @article{wenhao728_2025_vorta, author = {Wenhao and Li, Wenhao and Wang, Yanan and Zhao, Jizhao and Zheng, Wei}, title = {VORTA: Efficient Video Diffusion via Routing Sparse Attention}, journal = {arXiv preprint arXiv:2505.18809}, year = {2025} } ```