SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing
Abstract
SVG-EAR introduces a parameter-free method for video diffusion transformers that recovers missing attention contributions through centroid-based approximation and error-aware routing for improved efficiency.
Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77times and 1.93times speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
Community
New breakthrough for applying sparse attention to video diffusion!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers (2026)
- Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention (2026)
- SLA2: Sparse-Linear Attention with Learnable Routing and QAT (2026)
- MonarchRT: Efficient Attention for Real-Time Video Generation (2026)
- Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention (2026)
- Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers (2026)
- SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper