|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
|
--- |
|
|
|
|
|
# Fast-dLLM v2 (1.5B) β Efficient Block-Diffusion LLM |
|
|
|
|
|
## π Introduction |
|
|
|
|
|
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**. |
|
|
|
|
|
We present **Fast-dLLM v2** β a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-1.5B-Instruct**) into a diffusion-style decoder for **parallel text generation**. |
|
|
|
|
|
Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks. |
|
|
|
|
|
### β¨ Key Innovations |
|
|
- **Block Diffusion Mechanism + Complementary Attention Mask** |
|
|
Enables **blockwise bidirectional context modeling** without sacrificing AR objectives. |
|
|
- **Hierarchical Caching** |
|
|
- **Block-level cache**: Stores historical context representations across blocks. |
|
|
- **Sub-block cache**: Parallel decoding within partially generated blocks. |
|
|
- **Token Shift Mechanism** |
|
|
Retains autoregressive characteristics while supporting bidirectional context within blocks. |
|
|
- **Parallel Decoding Pipeline** |
|
|
Achieves up to **2.5Γ speedup** over standard AR decoding **without compromising quality**. |
|
|
|
|
|
> π Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning β a **500Γ reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) β while **matching or surpassing AR baselines** in accuracy. |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Overview |
|
|
- **Type**: Block Diffusion Language Model (dLLM) |
|
|
- **Base Model**: `Qwen/Qwen2.5-1.5B-Instruct` |
|
|
- **Architecture**: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings |
|
|
- **Params**: 1.54B (non-embedding: 1.31B) |
|
|
- **Layers**: 28 |
|
|
- **Attention Heads**: 12 (Q), 2 (KV, GQA) |
|
|
- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching** |
|
|
|
|
|
--- |
|
|
|
|
|
## π¦ Installation |
|
|
You will need `transformers`, `torch`, and our **custom generation function**: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch numpy |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Quickstart |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "Efficient-Large-Model/Fast_dLLM_1.5B" |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
prompt = "Give me a short introduction to large language model." |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
# Fast-dLLM v2 parallel decoding |
|
|
gen_ids = model.generate( |
|
|
inputs["input_ids"], |
|
|
tokenizer=tokenizer, |
|
|
max_new_tokens=512, |
|
|
small_block_size=8, |
|
|
threshold=0.9, |
|
|
) |
|
|
|
|
|
response = tokenizer.decode( |
|
|
gen_ids[0][inputs["input_ids"].shape[1]:], |
|
|
skip_special_tokens=True |
|
|
) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance & Benchmarks |
|
|
|
|
|
### βΆ Real-time Throughput |
|
|
Fast-dLLM v2 offers **up to 2.54Γ higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
### π Benchmark Results |
|
|
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks: |
|
|
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA). |
|
|
|
|
|
- **1B group**: Fast-dLLM v2 (1.5B) achieves **best average score: 45.0**. |
|
|
- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing LLaDA and Dream models. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use Fast-dLLM v2 in your research or products, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{wu2025fastdllmv2efficientblockdiffusion, |
|
|
title={Fast-dLLM v2: Efficient Block-Diffusion LLM}, |
|
|
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie}, |
|
|
year={2025}, |
|
|
eprint={2509.26328}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2509.26328}, |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
Released under **Apache 2.0**, following the base Qwen2.5 license. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Resources |
|
|
- π [Paper](https://arxiv.org/abs/2509.26328) |
|
|
- π» [Code](https://github.com/NVlabs/Fast-dLLM) |
|
|
- π€ [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_1.5B) |