eousphoros commited on
Commit
ce95d0c
·
verified ·
1 Parent(s): d487dcd

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +104 -116
README.md CHANGED
@@ -1,131 +1,103 @@
1
- ---
2
- license: mit
3
- library_name: transformers
4
- base_model:
5
- - deepseek-ai/DeepSeek-V3.2
6
- base_model_relation: quantized
7
- tags:
8
- - nvfp4
9
- - fp4
10
- - quantized
11
- - deepseek
12
- - moe
13
- ---
14
-
15
  # DeepSeek-V3.2-NVFP4
16
 
17
- This is an **NVFP4 (4-bit floating point) quantized** version of [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2).
 
 
18
 
19
  ## Model Description
20
 
21
- DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for reduced memory footprint and faster inference on NVIDIA Blackwell GPUs.
22
 
23
  ### Quantization Details
24
 
25
  | Property | Value |
26
  |----------|-------|
27
- | **Source Format** | FP8 E4M3 (128x128 block scales) |
28
- | **Target Format** | NVFP4 E2M1 (16-element block scales) |
29
- | **Quantization Method** | modelopt / NVFP4 |
30
- | **Original Size** | ~642 GB |
31
- | **Quantized Size** | ~391 GB |
32
- | **Compression** | ~39% reduction |
 
 
33
 
34
  ### Preserved Components (Not Quantized)
35
 
36
  The following sensitive components are preserved in their original precision to maintain model quality:
37
 
38
- - Embeddings (`model.embed_tokens`)
39
- - Output head (`lm_head`)
40
- - MoE router gates (`*.mlp.gate`)
41
  - Layer norms and RMS norms
42
- - DSA indexer weights (`indexer.weights_proj`, `indexer.k_norm`)
43
-
44
- ## Hardware Requirements
45
 
46
- - **Recommended**: NVIDIA Blackwell datacenter GPUs (B200, GB200) with native NVFP4 support
47
- - **Minimum VRAM**: ~200GB (with tensor parallelism across 2+ GPUs)
48
- - **Tested on**: 2x NVIDIA RTX Pro 6000 Blackwell (192GB total)
49
 
50
- > **Note**: NVFP4 inference currently has best support on datacenter Blackwell GPUs. Workstation GPUs may fall back to Marlin kernels.
51
 
52
- ## Usage
53
 
54
- ### With vLLM
55
 
56
  ```bash
57
- # Requires vLLM with modelopt NVFP4 support
58
- vllm serve eousphoros/DeepSeek-V3.2-NVFP4 \
59
- --tensor-parallel-size 2 \
60
- --trust-remote-code \
61
- --max-model-len 4096
62
- ```
63
 
64
- ### With TensorRT-LLM
 
65
 
66
- ```python
67
- from tensorrt_llm import LLM
68
 
69
- llm = LLM(
70
- model="eousphoros/DeepSeek-V3.2-NVFP4",
71
- tensor_parallel_size=2,
72
- enable_attention_dp=True
73
- )
 
74
  ```
75
 
76
- ## Chat Template
77
 
78
- This model uses the same chat template as the original DeepSeek-V3.2. See the `inference/` folder for Python scripts demonstrating message encoding.
79
-
80
- ```python
81
- from encoding_dsv32 import encode_messages, parse_message_from_completion_text
82
-
83
- messages = [
84
- {"role": "user", "content": "Hello!"},
85
- ]
86
- encode_config = dict(thinking_mode="thinking", drop_thinking=True, add_default_bos_token=True)
87
- prompt = encode_messages(messages, **encode_config)
88
- ```
89
 
90
- ## Reference Inference Implementation
91
 
92
- The `inference/` directory contains a standalone reference implementation:
93
 
94
- | File | Description |
95
- |------|-------------|
96
- | `model.py` | DeepSeek V3.2 model with MLA + sparse attention |
97
- | `generate.py` | Text generation with HF checkpoint loading |
98
- | `kernel.py` | FP8 runtime kernels (tilelang CUDA + CPU fallbacks) |
99
- | `nvfp4_kernel.py` | NVFP4 GEMM via dequantization |
100
- | `encoding_dsv32.py` | DeepSeek V3.2 chat template encoding |
101
 
102
- ### Running Reference Inference
 
 
 
103
 
104
- ```bash
105
- cd inference
 
 
106
 
107
- # Interactive mode
108
- python generate.py \
109
- --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
110
- --config ../config.json \
111
- --interactive \
112
- --max-new-tokens 200
113
- ```
114
 
115
- ### CPU/CUDA Dispatch
116
 
117
- The `kernel.py` module automatically dispatches to CPU fallbacks when:
118
- - Running on CPU (no CUDA available)
119
- - tilelang library not installed
 
120
 
121
- ```python
122
- from kernel import act_quant, fp8_gemm, fp8_index
 
 
123
 
124
- # These work on both CPU and CUDA
125
- y, scale = act_quant(x) # FP8 activation quantization
126
- output = fp8_gemm(a, a_s, b, b_s) # Block-scaled FP8 matmul
127
- scores = fp8_index(q, q_s, k, k_s) # Sparse attention indexing
128
- ```
129
 
130
  ## Architecture Notes
131
 
@@ -134,51 +106,64 @@ scores = fp8_index(q, q_s, k, k_s) # Sparse attention indexing
134
  - FP8 KV cache for memory efficiency
135
 
136
  ### Sparse Attention (DSA)
137
- - `Indexer` class computes attention pattern selection
138
  - Top-k sparse pattern for efficient long-context
139
 
140
- ### Layer 61 (MTP)
141
- - Multi-Token Prediction head (auxiliary training layer)
142
- - Can be discarded for inference - the model has 61 layers (0-60) with layer 61 being MTP
 
 
 
143
 
144
  ## Conversion Process
145
 
146
- This model was converted using a custom FP8NVFP4 streaming converter:
147
 
148
- 1. **Dequantize**: FP8 E4M3 weights FP32 (using 128x128 block inverse scales)
149
- 2. **Compute NVFP4 scales**:
150
  - Global scale: `scale_2 = amax / (6.0 * 448.0)`
151
  - Per-block scale: `scale = block_amax / (6.0 * scale_2)`
152
- 3. **Quantize**: FP32 NVFP4 E2M1 (16-element blocks)
153
- 4. **Pack**: Two FP4 values per uint8 byte
154
-
155
- ### MoE Joint Scale Handling
156
 
157
- For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by:
158
 
159
- 1. Identifying MoE gate/up pairs from the safetensor index
160
- 2. Loading both weights when either is encountered
161
- 3. Computing joint `amax = max(gate_amax, up_amax)`
162
- 4. Using the joint amax for shared `weight_scale_2`
163
- 5. Computing independent per-block `weight_scale` for each tensor
164
-
165
- This ensures fused GEMM compatibility while preserving per-block precision.
166
 
167
  ### Tensor Format
168
 
169
  For each quantized weight:
170
- - `*.weight`: Packed uint8 `[M, N/2]`
171
- - `*.weight_scale`: FP8 E4M3 per-block scale `[M, N/16]`
172
- - `*.weight_scale_2`: FP32 global scale `[1]`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  ## Acknowledgments
175
 
176
  - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
177
- - NVFP4 format based on [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
 
 
178
 
179
  ## License
180
 
181
- This model inherits the [MIT License](LICENSE) from the original DeepSeek-V3.2 model.
 
 
182
 
183
  ## Citation
184
 
@@ -190,7 +175,10 @@ This model inherits the [MIT License](LICENSE) from the original DeepSeek-V3.2 m
190
  }
191
  ```
192
 
 
 
193
  ## Contact
194
 
195
- For issues with the quantized version, please open an issue on this repository.
196
- For questions about the original model, contact [DeepSeek AI](mailto:[email protected]).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # DeepSeek-V3.2-NVFP4
2
 
3
+ NVFP4 (4-bit floating point) quantized version of DeepSeek-V3.2 with reference CPU inference implementation.
4
+
5
+ ---
6
 
7
  ## Model Description
8
 
9
+ DeepSeek-V3.2 is a 685B parameter Mixture-of-Experts (MoE) model with 37B activated parameters per token. This quantized version converts the original FP8 weights to NVFP4 format for 16x compression compared to FP32.
10
 
11
  ### Quantization Details
12
 
13
  | Property | Value |
14
  |----------|-------|
15
+ | Source Format | FP8 E4M3 (128x128 block scales) |
16
+ | Target Format | NVFP4 E2M1 (16-element block scales) |
17
+ | Quantization Method | Custom FP8 to NVFP4 converter |
18
+ | Original Size | Approximately 642 GB (FP8) |
19
+ | Quantized Size | 391 GB (NVFP4) |
20
+ | Compression | 16x vs FP32 |
21
+ | Conversion Errors | 0 |
22
+ | Weights Converted | 30,769 |
23
 
24
  ### Preserved Components (Not Quantized)
25
 
26
  The following sensitive components are preserved in their original precision to maintain model quality:
27
 
28
+ - Embeddings (model.embed_tokens)
29
+ - Output head (lm_head)
30
+ - MoE router gates (*.mlp.gate)
31
  - Layer norms and RMS norms
32
+ - DSA indexer weights (indexer.weights_proj, indexer.k_norm)
 
 
33
 
34
+ ---
 
 
35
 
36
+ ## Reference Implementation
37
 
38
+ The `inference/` directory contains a functional reference implementation for CPU inference:
39
 
40
+ ### Quick Start
41
 
42
  ```bash
43
+ cd inference
 
 
 
 
 
44
 
45
+ # Run unit tests (under 30 seconds)
46
+ python test_nvfp4_kernel.py
47
 
48
+ # Run forward pass test (10-15 minutes)
49
+ python test_forward_pass.py
50
 
51
+ # Interactive inference (slow on CPU: 2-5 min/token)
52
+ python generate.py \
53
+ --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
54
+ --config config_671B_nvfp4.json \
55
+ --interactive \
56
+ --max-new-tokens 10
57
  ```
58
 
59
+ ### Implementation Details
60
 
61
+ | File | Description |
62
+ |------|-------------|
63
+ | model.py | DeepSeek V3.2 architecture with NVFP4 support |
64
+ | generate.py | Text generation and inference pipeline |
65
+ | nvfp4_kernel.py | NVFP4 CPU dequantization kernels |
66
+ | kernel.py | FP8 runtime kernels with CPU fallbacks |
67
+ | encoding_dsv32.py | DeepSeek V3.2 message encoding |
68
+ | test_*.py | Comprehensive test suite |
 
 
 
69
 
70
+ See `inference/README.md` for complete documentation.
71
 
72
+ ---
73
 
74
+ ## Hardware Requirements
 
 
 
 
 
 
75
 
76
+ ### CPU Inference (Reference Implementation)
77
+ - RAM: Minimum 400GB
78
+ - CPU: Multi-core recommended
79
+ - Performance: Approximately 2-5 minutes per token
80
 
81
+ ### GPU Inference (Future)
82
+ - Requires completion of Triton NVFP4 kernels
83
+ - Target: NVIDIA Blackwell GPUs (SM100, SM120)
84
+ - Expected speedup: 100-1000x vs CPU
85
 
86
+ ---
 
 
 
 
 
 
87
 
88
+ ## NVFP4 Format Specification
89
 
90
+ ### E2M1 Floating Point
91
+ - 4 bits per value (16 representable values)
92
+ - Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
93
+ - Storage: 2 FP4 values packed per uint8 byte
94
 
95
+ ### Dual-Level Scaling
96
+ - Per-block scale: FP8 E4M3, 16 elements per block
97
+ - Global scale: FP32 scalar
98
+ - Formula: `value = packed * weight_scale * weight_scale_2`
99
 
100
+ ---
 
 
 
 
101
 
102
  ## Architecture Notes
103
 
 
106
  - FP8 KV cache for memory efficiency
107
 
108
  ### Sparse Attention (DSA)
109
+ - Indexer class computes attention pattern selection
110
  - Top-k sparse pattern for efficient long-context
111
 
112
+ ### Mixture of Experts (MoE)
113
+ - 256 routed experts per layer
114
+ - 1 shared expert per layer
115
+ - Top-8 routing with load balancing
116
+
117
+ ---
118
 
119
  ## Conversion Process
120
 
121
+ This model was converted using a custom FP8 to NVFP4 streaming converter:
122
 
123
+ 1. Dequantize: FP8 E4M3 weights to FP32 (using 128x128 block inverse scales)
124
+ 2. Compute NVFP4 scales:
125
  - Global scale: `scale_2 = amax / (6.0 * 448.0)`
126
  - Per-block scale: `scale = block_amax / (6.0 * scale_2)`
127
+ 3. Quantize: FP32 to NVFP4 E2M1 (16-element blocks)
128
+ 4. Pack: Two FP4 values per uint8 byte
 
 
129
 
130
+ Note: For vLLM's fused MoE kernels, `gate_proj` (w1) and `up_proj` (w3) within each expert must share the same `weight_scale_2`. The converter handles this by computing a joint `amax` across both tensors to derive the shared global scale.
131
 
132
+ See `tools/fp8_to_nvfp4_streaming.py` for the complete conversion implementation.
 
 
 
 
 
 
133
 
134
  ### Tensor Format
135
 
136
  For each quantized weight:
137
+ - `*.weight`: Packed uint8 [M, N/2]
138
+ - `*.weight_scale`: FP8 E4M3 per-block scale [M, N/16]
139
+ - `*.weight_scale_2`: FP32 global scale [1]
140
+
141
+ ---
142
+
143
+ ## Validation
144
+
145
+ Comprehensive testing completed:
146
+ - NVFP4 kernel unit tests: PASS
147
+ - Model loading: PASS (73 shards, 391GB)
148
+ - Forward pass: PASS (valid outputs, no NaN/Inf)
149
+ - Output quality: Coherent, semantically correct responses
150
+
151
+ See `conversion_report.json` for detailed conversion statistics.
152
+
153
+ ---
154
 
155
  ## Acknowledgments
156
 
157
  - Original model by [DeepSeek AI](https://huggingface.co/deepseek-ai)
158
+ - NVFP4 format based on NVIDIA TensorRT Model Optimizer
159
+
160
+ ---
161
 
162
  ## License
163
 
164
+ This model inherits the MIT License from the original DeepSeek-V3.2 model.
165
+
166
+ ---
167
 
168
  ## Citation
169
 
 
175
  }
176
  ```
177
 
178
+ ---
179
+
180
  ## Contact
181
 
182
+ For issues with the quantized version or reference implementation, please open an issue.
183
+
184
+ For questions about the original model, contact DeepSeek AI.