File size: 5,121 Bytes

cd3af22

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
---

# Fast-dLLM v2 (1.5B) — Efficient Block-Diffusion LLM

## 📖 Introduction

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**.

We present **Fast-dLLM v2** — a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-1.5B-Instruct**) into a diffusion-style decoder for **parallel text generation**.

Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks.

### ✨ Key Innovations
- **Block Diffusion Mechanism + Complementary Attention Mask**  
  Enables **blockwise bidirectional context modeling** without sacrificing AR objectives.
- **Hierarchical Caching**  
  - **Block-level cache**: Stores historical context representations across blocks.
  - **Sub-block cache**: Parallel decoding within partially generated blocks.
- **Token Shift Mechanism**  
  Retains autoregressive characteristics while supporting bidirectional context within blocks.
- **Parallel Decoding Pipeline**  
  Achieves up to **2.5× speedup** over standard AR decoding **without compromising quality**.

> 🚀 Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning — a **500× reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) — while **matching or surpassing AR baselines** in accuracy.


![Generation Process](assets/visualization_animation.gif)

---

## 🛠 Model Overview
- **Type**: Block Diffusion Language Model (dLLM)
- **Base Model**: `Qwen/Qwen2.5-1.5B-Instruct`
- **Architecture**: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings
- **Params**: 1.54B (non-embedding: 1.31B)
- **Layers**: 28
- **Attention Heads**: 12 (Q), 2 (KV, GQA)
- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching**

---

## 📦 Installation
You will need `transformers`, `torch`, and our **custom generation function**:

```bash
pip install transformers torch numpy
```

---

## 🚀 Quickstart

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Efficient-Large-Model/Fast_dLLM_1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Fast-dLLM v2 parallel decoding
gen_ids = model.generate(
    inputs["input_ids"],
    tokenizer=tokenizer,
    max_new_tokens=512,
    small_block_size=8,
    threshold=0.9,
)

response = tokenizer.decode(
    gen_ids[0][inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)
print(response)
```

---

## 📊 Performance & Benchmarks

### ▶ Real-time Throughput
Fast-dLLM v2 offers **up to 2.54× higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**.

![Throughput Comparison](assets/throughput.png)

---

### 🏆 Benchmark Results
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:  
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).

- **1B group**: Fast-dLLM v2 (1.5B) achieves **best average score: 45.0**.
- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing  LLaDA and Dream models.

![Benchmark Results](assets/benchmark_results.png)

---

## 📜 Citation

If you use Fast-dLLM v2 in your research or products, please cite:

```bibtex
@misc{wu2025fastdllmv2efficientblockdiffusion,
      title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, 
      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
      year={2025},
      eprint={2509.26328},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.26328}, 
}
```

---

## 📄 License
Released under **Apache 2.0**, following the base Qwen2.5 license.

---

## 🔗 Resources
- 📄 [Paper](https://arxiv.org/abs/2509.26328)  
- 💻 [Code](https://github.com/NVlabs/Fast-dLLM)  
- 🤗 [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B)