eousphoros
/

DeepSeek-V3.2-NVFP4

@@ -1,131 +1,103 @@
----
-license: mit
-library_name: transformers
-base_model:
-  - deepseek-ai/DeepSeek-V3.2
-base_model_relation: quantized
-tags:
-  - nvfp4
-  - fp4
-  - quantized
-  - deepseek
-  - moe
----
 # DeepSeek-V3.2-NVFP4
-This is an **NVFP4 (4-bit floating point) quantized** version of [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2).
 ## Model Description
-DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for reduced memory footprint and faster inference on NVIDIA Blackwell GPUs.
 ### Quantization Details
 | Property | Value |
 |----------|-------|
-| **Source Format** | FP8 E4M3 (128x128 block scales) |
-| **Target Format** | NVFP4 E2M1 (16-element block scales) |
-| **Quantization Method** | modelopt / NVFP4 |
-| **Original Size** | ~642 GB |
-| **Quantized Size** | ~391 GB |
-| **Compression** | ~39% reduction |
 ### Preserved Components (Not Quantized)
 The following sensitive components are preserved in their original precision to maintain model quality:
-- Embeddings (`model.embed_tokens`)
-- Output head (`lm_head`)
-- MoE router gates (`*.mlp.gate`)
 - Layer norms and RMS norms
-- DSA indexer weights (`indexer.weights_proj`, `indexer.k_norm`)
-## Hardware Requirements
-- **Recommended**: NVIDIA Blackwell datacenter GPUs (B200, GB200) with native NVFP4 support
-- **Minimum VRAM**: ~200GB (with tensor parallelism across 2+ GPUs)
-- **Tested on**: 2x NVIDIA RTX Pro 6000 Blackwell (192GB total)
-> **Note**: NVFP4 inference currently has best support on datacenter Blackwell GPUs. Workstation GPUs may fall back to Marlin kernels.
-## Usage
-### With vLLM
 ```bash
-# Requires vLLM with modelopt NVFP4 support
-vllm serve eousphoros/DeepSeek-V3.2-NVFP4 \
-  --tensor-parallel-size 2 \
-  --trust-remote-code \
-  --max-model-len 4096
-```
-### With TensorRT-LLM
-```python
-from tensorrt_llm import LLM
-llm = LLM(
-    model="eousphoros/DeepSeek-V3.2-NVFP4",
-    tensor_parallel_size=2,
-    enable_attention_dp=True
-)
 ```
-## Chat Template
-This model uses the same chat template as the original DeepSeek-V3.2. See the `inference/` folder for Python scripts demonstrating message encoding.
-```python
-from encoding_dsv32 import encode_messages, parse_message_from_completion_text
-messages = [
-    {"role": "user", "content": "Hello!"},
-]
-encode_config = dict(thinking_mode="thinking", drop_thinking=True, add_default_bos_token=True)
-prompt = encode_messages(messages, **encode_config)
-```
-## Reference Inference Implementation
-The `inference/` directory contains a standalone reference implementation:
-| File | Description |
-|------|-------------|
-| `model.py` | DeepSeek V3.2 model with MLA + sparse attention |
-| `generate.py` | Text generation with HF checkpoint loading |
-| `kernel.py` | FP8 runtime kernels (tilelang CUDA + CPU fallbacks) |
-| `nvfp4_kernel.py` | NVFP4 GEMM via dequantization |
-| `encoding_dsv32.py` | DeepSeek V3.2 chat template encoding |
-### Running Reference Inference
-```bash
-cd inference
-# Interactive mode
-python generate.py \
-    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
-    --config ../config.json \
-    --interactive \
-    --max-new-tokens 200
-```
-### CPU/CUDA Dispatch
-The `kernel.py` module automatically dispatches to CPU fallbacks when:
-- Running on CPU (no CUDA available)
-- tilelang library not installed
-```python
-from kernel import act_quant, fp8_gemm, fp8_index
-# These work on both CPU and CUDA
-y, scale = act_quant(x)  # FP8 activation quantization
-output = fp8_gemm(a, a_s, b, b_s)  # Block-scaled FP8 matmul
-scores = fp8_index(q, q_s, k, k_s)  # Sparse attention indexing
-```
 ## Architecture Notes
@@ -134,51 +106,64 @@ scores = fp8_index(q, q_s, k, k_s)  # Sparse attention indexing
 - FP8 KV cache for memory efficiency
 ### Sparse Attention (DSA)
-- `Indexer` class computes attention pattern selection
 - Top-k sparse pattern for efficient long-context
-### Layer 61 (MTP)
-- Multi-Token Prediction head (auxiliary training layer)
-- Can be discarded for inference - the model has 61 layers (0-60) with layer 61 being MTP
 ## Conversion Process
-This model was converted using a custom FP8→NVFP4 streaming converter:
-1. **Dequantize**: FP8 E4M3 weights → FP32 (using 128x128 block inverse scales)
-2. **Compute NVFP4 scales**:
    - Global scale: `scale_2 = amax / (6.0 * 448.0)`
    - Per-block scale: `scale = block_amax / (6.0 * scale_2)`
-3. **Quantize**: FP32 → NVFP4 E2M1 (16-element blocks)
-4. **Pack**: Two FP4 values per uint8 byte
-### MoE Joint Scale Handling
-For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by:
-1. Identifying MoE gate/up pairs from the safetensor index
-2. Loading both weights when either is encountered
-3. Computing joint `amax = max(gate_amax, up_amax)`
-4. Using the joint amax for shared `weight_scale_2`
-5. Computing independent per-block `weight_scale` for each tensor
-This ensures fused GEMM compatibility while preserving per-block precision.
 ### Tensor Format
 For each quantized weight:
-- `*.weight`: Packed uint8 `[M, N/2]`
-- `*.weight_scale`: FP8 E4M3 per-block scale `[M, N/16]`
-- `*.weight_scale_2`: FP32 global scale `[1]`
 ## Acknowledgments
 - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
-- NVFP4 format based on [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
 ## License
-This model inherits the [MIT License](LICENSE) from the original DeepSeek-V3.2 model.
 ## Citation
@@ -190,7 +175,10 @@ This model inherits the [MIT License](LICENSE) from the original DeepSeek-V3.2 m
 }
 ```
 ## Contact
-For issues with the quantized version, please open an issue on this repository.
-For questions about the original model, contact [DeepSeek AI](mailto:[email protected]).

 # DeepSeek-V3.2-NVFP4
+NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.
+---
 ## Model Description
+DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.
 ### Quantization Details
 | Property | Value |
 |----------|-------|
+| Source Format | FP8 E4M3 (128x128 block scales) |
+| Target Format | NVFP4 E2M1 (16-element block scales) |
+| Quantization Method | Custom FP8 to NVFP4 converter |
+| Original Size | Approximately 642 GB (FP8) |
+| Quantized Size | 391 GB (NVFP4) |
+| Compression | 16x vs FP32 |
+| Conversion Errors | 0 |
+| Weights Converted | 30,769 |
 ### Preserved Components (Not Quantized)
 The following sensitive components are preserved in their original precision to maintain model quality:
+- Embeddings (model.embed_tokens)
+- Output head (lm_head)
+- MoE router gates (*.mlp.gate)
 - Layer norms and RMS norms
+- DSA indexer weights (indexer.weights_proj, indexer.k_norm)
+---
+## Reference Implementation
+The `inference/` directory contains a functional reference implementation for CPU inference:
+### Quick Start
 ```bash
+cd inference
+# Run unit tests (under 30 seconds)
+python test_nvfp4_kernel.py
+# Run forward pass test (10-15 minutes)
+python test_forward_pass.py
+# Interactive inference (slow on CPU: 2-5 min/token)
+python generate.py \
+    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
+    --config config_671B_nvfp4.json \
+    --interactive \
+    --max-new-tokens 10
 ```
+### Implementation Details
+| File | Description |
+|------|-------------|
+| model.py | DeepSeek V3.2 architecture with NVFP4 support |
+| generate.py | Text generation and inference pipeline |
+| nvfp4_kernel.py | NVFP4 CPU dequantization kernels |
+| kernel.py | FP8 runtime kernels with CPU fallbacks |
+| encoding_dsv32.py | DeepSeek V3.2 message encoding |
+| test_*.py | Comprehensive test suite |
+See `inference/README.md` for complete documentation.
+---
+## Hardware Requirements
+### CPU Inference (Reference Implementation)
+- RAM: Minimum 400GB
+- CPU: Multi-core recommended
+- Performance: Approximately 2-5 minutes per token
+### GPU Inference (Future)
+- Requires completion of Triton NVFP4 kernels
+- Target: NVIDIA Blackwell GPUs (SM100, SM120)
+- Expected speedup: 100-1000x vs CPU
+---
+## NVFP4 Format Specification
+### E2M1 Floating Point
+- 4 bits per value (16 representable values)
+- Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
+- Storage: 2 FP4 values packed per uint8 byte
+### Dual-Level Scaling
+- Per-block scale: FP8 E4M3, 16 elements per block
+- Global scale: FP32 scalar
+- Formula: `value = packed * weight_scale * weight_scale_2`
+---
 ## Architecture Notes
 - FP8 KV cache for memory efficiency
 ### Sparse Attention (DSA)
+- Indexer class computes attention pattern selection
 - Top-k sparse pattern for efficient long-context
+### Mixture of Experts (MoE)
+- 256 routed experts per layer
+- 1 shared expert per layer
+- Top-8 routing with load balancing
+---
 ## Conversion Process
+This model was converted using a custom FP8 to NVFP4 streaming converter:
+1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
+2. Compute NVFP4 scales:
    - Global scale: `scale_2 = amax / (6.0 * 448.0)`
    - Per-block scale: `scale = block_amax / (6.0 * scale_2)`
+3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
+4. Pack: Two FP4 values per uint8 byte
+Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale.
+See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation.
 ### Tensor Format
 For each quantized weight:
+- `*.weight`: Packed uint8 [M, N/2]
+- `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16]
+- `*.weight_scale_2`: FP32 global scale [1]
+---
+## Validation
+Comprehensive testing completed:
+- NVFP4 kernel unit tests: PASS
+- Model loading: PASS (73 shards, 391GB)
+- Forward pass: PASS (valid outputs, no NaN/Inf)
+- Output quality: Coherent, semantically correct responses
+See `conversion_report.json` for detailed conversion statistics.
+---
 ## Acknowledgments
 - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
+- NVFP4 format based on NVIDIA TensorRT Model Optimizer
+---
 ## License
+This model inherits the MIT License from the original DeepSeek-V3.2 model.
+---
 ## Citation
 }
 ```
+---
 ## Contact
+For issues with the quantized version or reference implementation, please open an issue.
+For questions about the original model, contact DeepSeek AI.