File size: 5,121 Bytes
cd3af22 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
---
# Fast-dLLM v2 (1.5B) β Efficient Block-Diffusion LLM
## π Introduction
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**.
We present **Fast-dLLM v2** β a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-1.5B-Instruct**) into a diffusion-style decoder for **parallel text generation**.
Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.
### β¨ Key Innovations
- **Block Diffusion Mechanism + Complementary Attention Mask**
Enables **blockwise bidirectional context modeling** without sacrificing AR objectives.
- **Hierarchical Caching**
- **Block-level cache**: Stores historical context representations across blocks.
- **Sub-block cache**: Parallel decoding within partially generated blocks.
- **Token Shift Mechanism**
Retains autoregressive characteristics while supporting bidirectional context within blocks.
- **Parallel Decoding Pipeline**
Achieves up to **2.5Γ speedup** over standard AR decoding **without compromising quality**.
> π Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning β a **500Γ reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) β while **matching or surpassing AR baselines** in accuracy.

---
## π Model Overview
- **Type**: Block Diffusion Language Model (dLLM)
- **Base Model**: `Qwen/Qwen2.5-1.5B-Instruct`
- **Architecture**: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings
- **Params**: 1.54B (non-embedding: 1.31B)
- **Layers**: 28
- **Attention Heads**: 12 (Q), 2 (KV, GQA)
- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching**
---
## π¦ Installation
You will need `transformers`, `torch`, and our **custom generation function**:
```bash
pip install transformers torch numpy
```
---
## π Quickstart
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Fast-dLLM v2 parallel decoding
gen_ids = model.generate(
inputs["input_ids"],
tokenizer=tokenizer,
max_new_tokens=512,
small_block_size=8,
threshold=0.9,
)
response = tokenizer.decode(
gen_ids[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)
```
---
## π Performance & Benchmarks
### βΆ Real-time Throughput
Fast-dLLM v2 offers **up to 2.54Γ higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**.

---
### π Benchmark Results
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).
- **1B group**: Fast-dLLM v2 (1.5B) achieves **best average score: 45.0**.
- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing LLaDA and Dream models.

---
## π Citation
If you use Fast-dLLM v2 in your research or products, please cite:
```bibtex
@misc{wu2025fastdllmv2efficientblockdiffusion,
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2509.26328},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26328},
}
```
---
## π License
Released under **Apache 2.0**, following the base Qwen2.5 license.
---
## π Resources
- π [Paper](https://arxiv.org/abs/2509.26328)
- π» [Code](https://github.com/NVlabs/Fast-dLLM)
- π€ [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B) |