Upload README.md with huggingface_hub

ce95d0c verified 13 days ago

5.01 kB

	# DeepSeek-V3.2-NVFP4

	NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.

	---

	## Model Description

	DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.

	### Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Source Format \| FP8 E4M3 (128x128 block scales) \|
	\| Target Format \| NVFP4 E2M1 (16-element block scales) \|
	\| Quantization Method \| Custom FP8 to NVFP4 converter \|
	\| Original Size \| Approximately 642 GB (FP8) \|
	\| Quantized Size \| 391 GB (NVFP4) \|
	\| Compression \| 16x vs FP32 \|
	\| Conversion Errors \| 0 \|
	\| Weights Converted \| 30,769 \|

	### Preserved Components (Not Quantized)

	The following sensitive components are preserved in their original precision to maintain model quality:

	- Embeddings (model.embed_tokens)
	- Output head (lm_head)
	- MoE router gates (*.mlp.gate)
	- Layer norms and RMS norms
	- DSA indexer weights (indexer.weights_proj, indexer.k_norm)

	---

	## Reference Implementation

	The `inference/` directory contains a functional reference implementation for CPU inference:

	### Quick Start

	```bash
	cd inference

	# Run unit tests (under 30 seconds)
	python test_nvfp4_kernel.py

	# Run forward pass test (10-15 minutes)
	python test_forward_pass.py

	# Interactive inference (slow on CPU: 2-5 min/token)
	python generate.py \
	--ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
	--config config_671B_nvfp4.json \
	--interactive \
	--max-new-tokens 10
	```

	### Implementation Details

	\| File \| Description \|
	\|------\|-------------\|
	\| model.py \| DeepSeek V3.2 architecture with NVFP4 support \|
	\| generate.py \| Text generation and inference pipeline \|
	\| nvfp4_kernel.py \| NVFP4 CPU dequantization kernels \|
	\| kernel.py \| FP8 runtime kernels with CPU fallbacks \|
	\| encoding_dsv32.py \| DeepSeek V3.2 message encoding \|
	\| test_*.py \| Comprehensive test suite \|

	See `inference/README.md` for complete documentation.

	---

	## Hardware Requirements

	### CPU Inference (Reference Implementation)
	- RAM: Minimum 400GB
	- CPU: Multi-core recommended
	- Performance: Approximately 2-5 minutes per token

	### GPU Inference (Future)
	- Requires completion of Triton NVFP4 kernels
	- Target: NVIDIA Blackwell GPUs (SM100, SM120)
	- Expected speedup: 100-1000x vs CPU

	---

	## NVFP4 Format Specification

	### E2M1 Floating Point
	- 4 bits per value (16 representable values)
	- Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
	- Storage: 2 FP4 values packed per uint8 byte

	### Dual-Level Scaling
	- Per-block scale: FP8 E4M3, 16 elements per block
	- Global scale: FP32 scalar
	- Formula: `value = packed * weight_scale * weight_scale_2`

	---

	## Architecture Notes

	### Multi-head Latent Attention (MLA)
	- Compressed KV cache using latent projection
	- FP8 KV cache for memory efficiency

	### Sparse Attention (DSA)
	- Indexer class computes attention pattern selection
	- Top-k sparse pattern for efficient long-context

	### Mixture of Experts (MoE)
	- 256 routed experts per layer
	- 1 shared expert per layer
	- Top-8 routing with load balancing

	---

	## Conversion Process

	This model was converted using a custom FP8 to NVFP4 streaming converter:

	1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
	2. Compute NVFP4 scales:
	- Global scale: `scale_2 = amax / (6.0 * 448.0)`
	- Per-block scale: `scale = block_amax / (6.0 * scale_2)`
	3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
	4. Pack: Two FP4 values per uint8 byte

	Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale.

	See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation.

	### Tensor Format

	For each quantized weight:
	- `*.weight`: Packed uint8 [M, N/2]
	- `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16]
	- `*.weight_scale_2`: FP32 global scale [1]

	---

	## Validation

	Comprehensive testing completed:
	- NVFP4 kernel unit tests: PASS
	- Model loading: PASS (73 shards, 391GB)
	- Forward pass: PASS (valid outputs, no NaN/Inf)
	- Output quality: Coherent, semantically correct responses

	See `conversion_report.json` for detailed conversion statistics.

	---

	## Acknowledgments

	- Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
	- NVFP4 format based on NVIDIA TensorRT Model Optimizer

	---

	## License

	This model inherits the MIT License from the original DeepSeek-V3.2 model.

	---

	## Citation

	```bibtex
	@misc{deepseekai2025deepseekv32,
	title={DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models},
	author={DeepSeek-AI},
	year={2025},
	}
	```

	---

	## Contact

	For issues with the quantized version or reference implementation, please open an issue.

	For questions about the original model, contact DeepSeek AI.