--- license: mit library_name: transformers base_model: deepseek-ai/DeepSeek-V3.2 tags: - nvfp4 - fp4 - quantized - deepseek - moe --- # DeepSeek-V3.2-NVFP4 NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation. --- ## Model Description DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32. ### Quantization Details | Property | Value | |----------|-------| | Source Format | FP8 E4M3 (128x128 block scales) | | Target Format | NVFP4 E2M1 (16-element block scales) | | Quantization Method | Custom FP8 to NVFP4 converter | | Original Size | Approximately 642 GB (FP8) | | Quantized Size | 391 GB (NVFP4) | | Compression | 16x vs FP32 | | Conversion Errors | 0 | | Weights Converted | 30,769 | ### Preserved Components (Not Quantized) The following sensitive components are preserved in their original precision to maintain model quality: - Embeddings (model.embed_tokens) - Output head (lm_head) - MoE router gates (*.mlp.gate) - Layer norms and RMS norms - DSA indexer weights (indexer.weights_proj, indexer.k_norm) --- ## Reference Implementation The `inference/` directory contains a functional reference implementation for CPU inference: ### Quick Start ```bash cd inference # Run unit tests (under 30 seconds) python test_nvfp4_kernel.py # Run forward pass test (10-15 minutes) python test_forward_pass.py # Interactive inference (slow on CPU: 2-5 min/token) python generate.py \ --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \ --config config_671B_nvfp4.json \ --interactive \ --max-new-tokens 10 ``` ### Implementation Details | File | Description | |------|-------------| | model.py | DeepSeek V3.2 architecture with NVFP4 support | | generate.py | Text generation and inference pipeline | | nvfp4_kernel.py | NVFP4 CPU dequantization kernels | | kernel.py | FP8 runtime kernels with CPU fallbacks | | encoding_dsv32.py | DeepSeek V3.2 message encoding | | test_*.py | Comprehensive test suite | See `inference/README.md` for complete documentation. --- ## Hardware Requirements ### CPU Inference (Reference Implementation) - RAM: Minimum 400GB - CPU: Multi-core recommended - Performance: Approximately 2-5 minutes per token ### GPU Inference (Future) - Requires completion of Triton NVFP4 kernels - Target: NVIDIA Blackwell GPUs (SM100, SM120) - Expected speedup: 100-1000x vs CPU --- ## NVFP4 Format Specification ### E2M1 Floating Point - 4 bits per value (16 representable values) - Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6} - Storage: 2 FP4 values packed per uint8 byte ### Dual-Level Scaling - Per-block scale: FP8 E4M3, 16 elements per block - Global scale: FP32 scalar - Formula: `value = packed * weight_scale * weight_scale_2` --- ## Architecture Notes ### Multi-head Latent Attention (MLA) - Compressed KV cache using latent projection - FP8 KV cache for memory efficiency ### Sparse Attention (DSA) - Indexer class computes attention pattern selection - Top-k sparse pattern for efficient long-context ### Mixture of Experts (MoE) - 256 routed experts per layer - 1 shared expert per layer - Top-8 routing with load balancing --- ## Conversion Process This model was converted using a custom FP8 to NVFP4 streaming converter: 1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales) 2. Compute NVFP4 scales: - Global scale: `scale_2 = amax / (6.0 * 448.0)` - Per-block scale: `scale = block_amax / (6.0 * scale_2)` 3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks) 4. Pack: Two FP4 values per uint8 byte Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale. See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation. ### Tensor Format For each quantized weight: - `*.weight`: Packed uint8 [M, N/2] - `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16] - `*.weight_scale_2`: FP32 global scale [1] --- ## Validation Comprehensive testing completed: - NVFP4 kernel unit tests: PASS - Model loading: PASS (73 shards, 391GB) - Forward pass: PASS (valid outputs, no NaN/Inf) - Output quality: Coherent, semantically correct responses See `conversion_report.json` for detailed conversion statistics. --- ## Acknowledgments - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai) - NVFP4 format based on NVIDIA TensorRT Model Optimizer --- ## License This model inherits the MIT License from the original DeepSeek-V3.2 model. --- ## Citation ```bibtex @misc{deepseekai2025deepseekv32, title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models}, author={DeepSeek-AI}, year={2025}, } ``` --- ## Contact For issues with the quantized version or reference implementation, please open an issue. For questions about the original model, contact DeepSeek AI.