---
license: mit
library_name: transformers
base_model: deepseek-ai/DeepSeek-V3.2
tags:
  - nvfp4
  - fp4
  - quantized
  - deepseek
  - moe
---

# DeepSeek-V3.2-NVFP4

NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.

---

## Model Description

DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.

### Quantization Details

| Property | Value |
|----------|-------|
| Source Format | FP8 E4M3 (128x128 block scales) |
| Target Format | NVFP4 E2M1 (16-element block scales) |
| Quantization Method | Custom FP8 to NVFP4 converter |
| Original Size | Approximately 642 GB (FP8) |
| Quantized Size | 391 GB (NVFP4) |
| Compression | 16x vs FP32 |
| Conversion Errors | 0 |
| Weights Converted | 30,769 |

### Preserved Components (Not Quantized)

The following sensitive components are preserved in their original precision to maintain model quality:

- Embeddings (model.embed_tokens)
- Output head (lm_head)
- MoE router gates (*.mlp.gate)
- Layer norms and RMS norms
- DSA indexer weights (indexer.weights_proj, indexer.k_norm)

---

## Reference Implementation

The `inference/` directory contains a functional reference implementation for CPU inference:

### Quick Start

```bash
cd inference

# Run unit tests (under 30 seconds)
python test_nvfp4_kernel.py

# Run forward pass test (10-15 minutes)
python test_forward_pass.py

# Interactive inference (slow on CPU: 2-5 min/token)
python generate.py \
    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
    --config config_671B_nvfp4.json \
    --interactive \
    --max-new-tokens 10
```

### Implementation Details

| File | Description |
|------|-------------|
| model.py | DeepSeek V3.2 architecture with NVFP4 support |
| generate.py | Text generation and inference pipeline |
| nvfp4_kernel.py | NVFP4 CPU dequantization kernels |
| kernel.py | FP8 runtime kernels with CPU fallbacks |
| encoding_dsv32.py | DeepSeek V3.2 message encoding |
| test_*.py | Comprehensive test suite |

See `inference/README.md` for complete documentation.

---

## Hardware Requirements

### CPU Inference (Reference Implementation)
- RAM: Minimum 400GB
- CPU: Multi-core recommended
- Performance: Approximately 2-5 minutes per token

### GPU Inference (Future)
- Requires completion of Triton NVFP4 kernels
- Target: NVIDIA Blackwell GPUs (SM100, SM120)
- Expected speedup: 100-1000x vs CPU

---

## NVFP4 Format Specification

### E2M1 Floating Point
- 4 bits per value (16 representable values)
- Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
- Storage: 2 FP4 values packed per uint8 byte

### Dual-Level Scaling
- Per-block scale: FP8 E4M3, 16 elements per block
- Global scale: FP32 scalar
- Formula: `value = packed * weight_scale * weight_scale_2`

---

## Architecture Notes

### Multi-head Latent Attention (MLA)
- Compressed KV cache using latent projection
- FP8 KV cache for memory efficiency

### Sparse Attention (DSA)
- Indexer class computes attention pattern selection
- Top-k sparse pattern for efficient long-context

### Mixture of Experts (MoE)
- 256 routed experts per layer
- 1 shared expert per layer
- Top-8 routing with load balancing

---

## Conversion Process

This model was converted using a custom FP8 to NVFP4 streaming converter:

1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
2. Compute NVFP4 scales:
   - Global scale: `scale_2 = amax / (6.0 * 448.0)`
   - Per-block scale: `scale = block_amax / (6.0 * scale_2)`
3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
4. Pack: Two FP4 values per uint8 byte

Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale.

See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation.

### Tensor Format

For each quantized weight:
- `*.weight`: Packed uint8 [M, N/2]
- `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16]
- `*.weight_scale_2`: FP32 global scale [1]

---

## Validation

Comprehensive testing completed:
- NVFP4 kernel unit tests: PASS
- Model loading: PASS (73 shards, 391GB)
- Forward pass: PASS (valid outputs, no NaN/Inf)
- Output quality: Coherent, semantically correct responses

See `conversion_report.json` for detailed conversion statistics.

---

## Acknowledgments

- Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
- NVFP4 format based on NVIDIA TensorRT Model Optimizer

---

## License

This model inherits the MIT License from the original DeepSeek-V3.2 model.

---

## Citation

```bibtex
@misc{deepseekai2025deepseekv32,
      title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
      author={DeepSeek-AI},
      year={2025},
}
```

---

## Contact

For issues with the quantized version or reference implementation, please open an issue.

For questions about the original model, contact DeepSeek AI.