File size: 5,121 Bytes
cd3af22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
---

# Fast-dLLM v2 (1.5B) β€” Efficient Block-Diffusion LLM

## πŸ“– Introduction

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**.

We present **Fast-dLLM v2** β€” a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-1.5B-Instruct**) into a diffusion-style decoder for **parallel text generation**.

Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.

### ✨ Key Innovations
- **Block Diffusion Mechanism + Complementary Attention Mask**  
  Enables **blockwise bidirectional context modeling** without sacrificing AR objectives.
- **Hierarchical Caching**  
  - **Block-level cache**: Stores historical context representations across blocks.
  - **Sub-block cache**: Parallel decoding within partially generated blocks.
- **Token Shift Mechanism**  
  Retains autoregressive characteristics while supporting bidirectional context within blocks.
- **Parallel Decoding Pipeline**  
  Achieves up to **2.5Γ— speedup** over standard AR decoding **without compromising quality**.

> πŸš€ Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning β€” a **500Γ— reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) β€” while **matching or surpassing AR baselines** in accuracy.


![Generation Process](assets/visualization_animation.gif)

---

## πŸ›  Model Overview
- **Type**: Block Diffusion Language Model (dLLM)
- **Base Model**: `Qwen/Qwen2.5-1.5B-Instruct`
- **Architecture**: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings
- **Params**: 1.54B (non-embedding: 1.31B)
- **Layers**: 28
- **Attention Heads**: 12 (Q), 2 (KV, GQA)
- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching**

---

## πŸ“¦ Installation
You will need `transformers`, `torch`, and our **custom generation function**:

```bash
pip install transformers torch numpy
```

---

## πŸš€ Quickstart

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Fast-dLLM v2 parallel decoding
gen_ids = model.generate(
    inputs["input_ids"],
    tokenizer=tokenizer,
    max_new_tokens=512,
    small_block_size=8,
    threshold=0.9,
)

response = tokenizer.decode(
    gen_ids[0][inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)
print(response)
```

---

## πŸ“Š Performance & Benchmarks

### β–Ά Real-time Throughput
Fast-dLLM v2 offers **up to 2.54Γ— higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**.

![Throughput Comparison](assets/throughput.png)

---

### πŸ† Benchmark Results
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:  
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).

- **1B group**: Fast-dLLM v2 (1.5B) achieves **best average score: 45.0**.
- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing  LLaDA and Dream models.

![Benchmark Results](assets/benchmark_results.png)

---

## πŸ“œ Citation

If you use Fast-dLLM v2 in your research or products, please cite:

```bibtex
@misc{wu2025fastdllmv2efficientblockdiffusion,
      title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2509.26328},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.26328}, 
}
```

---

## πŸ“„ License
Released under **Apache 2.0**, following the base Qwen2.5 license.

---

## πŸ”— Resources
- πŸ“„ [Paper](https://arxiv.org/abs/2509.26328)  
- πŸ’» [Code](https://github.com/NVlabs/Fast-dLLM)  
- πŸ€— [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B)