DeepSeek-V3.2-NVFP4 / README.md
eousphoros's picture
Upload README.md with huggingface_hub
ce95d0c verified
|
raw
history blame
5.01 kB
# DeepSeek-V3.2-NVFP4
NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.
---
## Model Description
DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.
### Quantization Details
| Property | Value |
|----------|-------|
| Source Format | FP8 E4M3 (128x128 block scales) |
| Target Format | NVFP4 E2M1 (16-element block scales) |
| Quantization Method | Custom FP8 to NVFP4 converter |
| Original Size | Approximately 642 GB (FP8) |
| Quantized Size | 391 GB (NVFP4) |
| Compression | 16x vs FP32 |
| Conversion Errors | 0 |
| Weights Converted | 30,769 |
### Preserved Components (Not Quantized)
The following sensitive components are preserved in their original precision to maintain model quality:
- Embeddings (model.embed_tokens)
- Output head (lm_head)
- MoE router gates (*.mlp.gate)
- Layer norms and RMS norms
- DSA indexer weights (indexer.weights_proj, indexer.k_norm)
---
## Reference Implementation
The `inference/` directory contains a functional reference implementation for CPU inference:
### Quick Start
```bash
cd inference
# Run unit tests (under 30 seconds)
python test_nvfp4_kernel.py
# Run forward pass test (10-15 minutes)
python test_forward_pass.py
# Interactive inference (slow on CPU: 2-5 min/token)
python generate.py \
--ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
--config config_671B_nvfp4.json \
--interactive \
--max-new-tokens 10
```
### Implementation Details
| File | Description |
|------|-------------|
| model.py | DeepSeek V3.2 architecture with NVFP4 support |
| generate.py | Text generation and inference pipeline |
| nvfp4_kernel.py | NVFP4 CPU dequantization kernels |
| kernel.py | FP8 runtime kernels with CPU fallbacks |
| encoding_dsv32.py | DeepSeek V3.2 message encoding |
| test_*.py | Comprehensive test suite |
See `inference/README.md` for complete documentation.
---
## Hardware Requirements
### CPU Inference (Reference Implementation)
- RAM: Minimum 400GB
- CPU: Multi-core recommended
- Performance: Approximately 2-5 minutes per token
### GPU Inference (Future)
- Requires completion of Triton NVFP4 kernels
- Target: NVIDIA Blackwell GPUs (SM100, SM120)
- Expected speedup: 100-1000x vs CPU
---
## NVFP4 Format Specification
### E2M1 Floating Point
- 4 bits per value (16 representable values)
- Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
- Storage: 2 FP4 values packed per uint8 byte
### Dual-Level Scaling
- Per-block scale: FP8 E4M3, 16 elements per block
- Global scale: FP32 scalar
- Formula: `value = packed * weight_scale * weight_scale_2`
---
## Architecture Notes
### Multi-head Latent Attention (MLA)
- Compressed KV cache using latent projection
- FP8 KV cache for memory efficiency
### Sparse Attention (DSA)
- Indexer class computes attention pattern selection
- Top-k sparse pattern for efficient long-context
### Mixture of Experts (MoE)
- 256 routed experts per layer
- 1 shared expert per layer
- Top-8 routing with load balancing
---
## Conversion Process
This model was converted using a custom FP8 to NVFP4 streaming converter:
1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
2. Compute NVFP4 scales:
- Global scale: `scale_2 = amax / (6.0 * 448.0)`
- Per-block scale: `scale = block_amax / (6.0 * scale_2)`
3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
4. Pack: Two FP4 values per uint8 byte
Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale.
See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation.
### Tensor Format
For each quantized weight:
- `*.weight`: Packed uint8 [M, N/2]
- `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16]
- `*.weight_scale_2`: FP32 global scale [1]
---
## Validation
Comprehensive testing completed:
- NVFP4 kernel unit tests: PASS
- Model loading: PASS (73 shards, 391GB)
- Forward pass: PASS (valid outputs, no NaN/Inf)
- Output quality: Coherent, semantically correct responses
See `conversion_report.json` for detailed conversion statistics.
---
## Acknowledgments
- Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
- NVFP4 format based on NVIDIA TensorRT Model Optimizer
---
## License
This model inherits the MIT License from the original DeepSeek-V3.2 model.
---
## Citation
```bibtex
@misc{deepseekai2025deepseekv32,
title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
author={DeepSeek-AI},
year={2025},
}
```
---
## Contact
For issues with the quantized version or reference implementation, please open an issue.
For questions about the original model, contact DeepSeek AI.