DeepSeek-V3.2-NVFP4
NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.
Model Description
DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.
Quantization Details
| Property | Value |
|---|---|
| Source Format | FP8 E4M3 (128x128 block scales) |
| Target Format | NVFP4 E2M1 (16-element block scales) |
| Quantization Method | Custom FP8 to NVFP4 converter |
| Original Size | Approximately 642 GB (FP8) |
| Quantized Size | 391 GB (NVFP4) |
| Compression | 16x vs FP32 |
| Conversion Errors | 0 |
| Weights Converted | 30,769 |
Preserved Components (Not Quantized)
The following sensitive components are preserved in their original precision to maintain model quality:
- Embeddings (model.embed_tokens)
- Output head (lm_head)
- MoE router gates (*.mlp.gate)
- Layer norms and RMS norms
- DSA indexer weights (indexer.weights_proj, indexer.k_norm)
Reference Implementation
The inference/ directory contains a functional reference implementation for CPU inference:
Quick Start
cd inference
# Run unit tests (under 30 seconds)
python test_nvfp4_kernel.py
# Run forward pass test (10-15 minutes)
python test_forward_pass.py
# Interactive inference (slow on CPU: 2-5 min/token)
python generate.py \
--ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
--config config_671B_nvfp4.json \
--interactive \
--max-new-tokens 10
Implementation Details
| File | Description |
|---|---|
| model.py | DeepSeek V3.2 architecture with NVFP4 support |
| generate.py | Text generation and inference pipeline |
| nvfp4_kernel.py | NVFP4 CPU dequantization kernels |
| kernel.py | FP8 runtime kernels with CPU fallbacks |
| encoding_dsv32.py | DeepSeek V3.2 message encoding |
| test_*.py | Comprehensive test suite |
See inference/README.md for complete documentation.
Hardware Requirements
CPU Inference (Reference Implementation)
- RAM: Minimum 400GB
- CPU: Multi-core recommended
- Performance: Approximately 2-5 minutes per token
GPU Inference (Future)
- Requires completion of Triton NVFP4 kernels
- Target: NVIDIA Blackwell GPUs (SM100, SM120)
- Expected speedup: 100-1000x vs CPU
NVFP4 Format Specification
E2M1 Floating Point
- 4 bits per value (16 representable values)
- Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
- Storage: 2 FP4 values packed per uint8 byte
Dual-Level Scaling
- Per-block scale: FP8 E4M3, 16 elements per block
- Global scale: FP32 scalar
- Formula:
value = packed * weight_scale * weight_scale_2
Architecture Notes
Multi-head Latent Attention (MLA)
- Compressed KV cache using latent projection
- FP8 KV cache for memory efficiency
Sparse Attention (DSA)
- Indexer class computes attention pattern selection
- Top-k sparse pattern for efficient long-context
Mixture of Experts (MoE)
- 256 routed experts per layer
- 1 shared expert per layer
- Top-8 routing with load balancing
Conversion Process
This model was converted using a custom FP8 to NVFP4 streaming converter:
- Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
- Compute NVFP4 scales:
- Global scale:
scale_2 = amax / (6.0 * 448.0) - Per-block scale:
scale = block_amax / (6.0 * scale_2)
- Global scale:
- Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
- Pack: Two FP4 values per uint8 byte
Note: For vLLM's fused MoE kernels, gate_proj (w1) and up_proj (w3) within each expert must share the same weight_scale_2. The converter handles this by computing a joint amax across both tensors to derive the shared global scale.
See tools/fp8_to_nvfp4_streaming.py for the complete conversion implementation.
Tensor Format
For each quantized weight:
*.weight: Packed uint8 [M, N/2]*.weight_scale: FP8 E4M3 per-block scale [M, N/16]*.weight_scale_2: FP32 global scale [1]
Validation
Comprehensive testing completed:
- NVFP4 kernel unit tests: PASS
- Model loading: PASS (73 shards, 391GB)
- Forward pass: PASS (valid outputs, no NaN/Inf)
- Output quality: Coherent, semantically correct responses
See conversion_report.json for detailed conversion statistics.
Acknowledgments
- Original model by DeepSeek AI
- NVFP4 format based on NVIDIA TensorRT Model Optimizer
License
This model inherits the MIT License from the original DeepSeek-V3.2 model.
Citation
@misc{deepseekai2025deepseekv32,
title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
author={DeepSeek-AI},
year={2025},
}
Contact
For issues with the quantized version or reference implementation, please open an issue.
For questions about the original model, contact DeepSeek AI.
- Downloads last month
- 138
Model tree for eousphoros/DeepSeek-V3.2-NVFP4
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base