| # DeepSeek-V3.2-NVFP4 | |
| NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation. | |
| --- | |
| ## Model Description | |
| DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32. | |
| ### Quantization Details | |
| | Property | Value | | |
| |----------|-------| | |
| | Source Format | FP8 E4M3 (128x128 block scales) | | |
| | Target Format | NVFP4 E2M1 (16-element block scales) | | |
| | Quantization Method | Custom FP8 to NVFP4 converter | | |
| | Original Size | Approximately 642 GB (FP8) | | |
| | Quantized Size | 391 GB (NVFP4) | | |
| | Compression | 16x vs FP32 | | |
| | Conversion Errors | 0 | | |
| | Weights Converted | 30,769 | | |
| ### Preserved Components (Not Quantized) | |
| The following sensitive components are preserved in their original precision to maintain model quality: | |
| - Embeddings (model.embed_tokens) | |
| - Output head (lm_head) | |
| - MoE router gates (*.mlp.gate) | |
| - Layer norms and RMS norms | |
| - DSA indexer weights (indexer.weights_proj, indexer.k_norm) | |
| --- | |
| ## Reference Implementation | |
| The `inference/` directory contains a functional reference implementation for CPU inference: | |
| ### Quick Start | |
| ```bash | |
| cd inference | |
| # Run unit tests (under 30 seconds) | |
| python test_nvfp4_kernel.py | |
| # Run forward pass test (10-15 minutes) | |
| python test_forward_pass.py | |
| # Interactive inference (slow on CPU: 2-5 min/token) | |
| python generate.py \ | |
| --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \ | |
| --config config_671B_nvfp4.json \ | |
| --interactive \ | |
| --max-new-tokens 10 | |
| ``` | |
| ### Implementation Details | |
| | File | Description | | |
| |------|-------------| | |
| | model.py | DeepSeek V3.2 architecture with NVFP4 support | | |
| | generate.py | Text generation and inference pipeline | | |
| | nvfp4_kernel.py | NVFP4 CPU dequantization kernels | | |
| | kernel.py | FP8 runtime kernels with CPU fallbacks | | |
| | encoding_dsv32.py | DeepSeek V3.2 message encoding | | |
| | test_*.py | Comprehensive test suite | | |
| See `inference/README.md` for complete documentation. | |
| --- | |
| ## Hardware Requirements | |
| ### CPU Inference (Reference Implementation) | |
| - RAM: Minimum 400GB | |
| - CPU: Multi-core recommended | |
| - Performance: Approximately 2-5 minutes per token | |
| ### GPU Inference (Future) | |
| - Requires completion of Triton NVFP4 kernels | |
| - Target: NVIDIA Blackwell GPUs (SM100, SM120) | |
| - Expected speedup: 100-1000x vs CPU | |
| --- | |
| ## NVFP4 Format Specification | |
| ### E2M1 Floating Point | |
| - 4 bits per value (16 representable values) | |
| - Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6} | |
| - Storage: 2 FP4 values packed per uint8 byte | |
| ### Dual-Level Scaling | |
| - Per-block scale: FP8 E4M3, 16 elements per block | |
| - Global scale: FP32 scalar | |
| - Formula: `value = packed * weight_scale * weight_scale_2` | |
| --- | |
| ## Architecture Notes | |
| ### Multi-head Latent Attention (MLA) | |
| - Compressed KV cache using latent projection | |
| - FP8 KV cache for memory efficiency | |
| ### Sparse Attention (DSA) | |
| - Indexer class computes attention pattern selection | |
| - Top-k sparse pattern for efficient long-context | |
| ### Mixture of Experts (MoE) | |
| - 256 routed experts per layer | |
| - 1 shared expert per layer | |
| - Top-8 routing with load balancing | |
| --- | |
| ## Conversion Process | |
| This model was converted using a custom FP8 to NVFP4 streaming converter: | |
| 1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales) | |
| 2. Compute NVFP4 scales: | |
| - Global scale: `scale_2 = amax / (6.0 * 448.0)` | |
| - Per-block scale: `scale = block_amax / (6.0 * scale_2)` | |
| 3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks) | |
| 4. Pack: Two FP4 values per uint8 byte | |
| Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale. | |
| See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation. | |
| ### Tensor Format | |
| For each quantized weight: | |
| - `*.weight`: Packed uint8 [M, N/2] | |
| - `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16] | |
| - `*.weight_scale_2`: FP32 global scale [1] | |
| --- | |
| ## Validation | |
| Comprehensive testing completed: | |
| - NVFP4 kernel unit tests: PASS | |
| - Model loading: PASS (73 shards, 391GB) | |
| - Forward pass: PASS (valid outputs, no NaN/Inf) | |
| - Output quality: Coherent, semantically correct responses | |
| See `conversion_report.json` for detailed conversion statistics. | |
| --- | |
| ## Acknowledgments | |
| - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai) | |
| - NVFP4 format based on NVIDIA TensorRT Model Optimizer | |
| --- | |
| ## License | |
| This model inherits the MIT License from the original DeepSeek-V3.2 model. | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{deepseekai2025deepseekv32, | |
| title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models}, | |
| author={DeepSeek-AI}, | |
| year={2025}, | |
| } | |
| ``` | |
| --- | |
| ## Contact | |
| For issues with the quantized version or reference implementation, please open an issue. | |
| For questions about the original model, contact DeepSeek AI. | |