eousphoros
/

DeepSeek-V3.2-NVFP4

Text Generation

Mixture of Experts

8-bit precision

Model card Files Files and versions

DeepSeek-V3.2-NVFP4 / CHANGELOG.md

eousphoros's picture

Upload CHANGELOG.md with huggingface_hub

17a0e58 verified 7 days ago

|

history blame contribute delete

4.06 kB

Changelog - NVFP4 Reference Implementation

[1.0.0] - December 3, 2025

Added

Core Implementation

Complete NVFP4 CPU inference for DeepSeek V3.2 (671B parameters)
NVFP4 dequantization kernels with dual-level scaling
Multi-Head Latent Attention (MLA) with NVFP4 support
Mixture of Experts (MoE) with 256 experts
FP8 and BF16 fallback support
Sharded weight loading (73 shards, 391GB)

Test Suite

test_nvfp4_kernel.py - NVFP4 math unit tests (5 tests, all passing)
test_model_loading.py - Weight loading integration tests
test_forward_pass.py - Forward pass validation tests
test_minimal_generation.py - Token generation tests

Documentation

README.md - User guide with quick start and examples
IMPLEMENTATION_SUMMARY.md - Technical implementation details
COMPLETION_REPORT.md - Project completion summary
CODE_REVIEW_FIXES.md - Code review documentation
FINAL_SUMMARY.md - Final project status

Tools

tools/fp8_to_nvfp4_streaming.py - FP8 to NVFP4 conversion script
tools/README.md - Quantization tools documentation
encoding/README.md - Message encoding documentation

Fixed

Critical Bugs

FP8 scale linking after load_state_dict() (generate.py:213-234)
NVFP4 attribute safety checks for mixed quantization (model.py:685)
NVFP4 decode path value projection computation (model.py:674-706)
- Was computing projections from current token only
- Now correctly projects attended context from KV cache
- Verified correct by Claude Opus 4.5 code review

Performance Issues

Added LUT device caching to eliminate repeated transfers (nvfp4_kernel.py:29-39)
Added wkv_b weight caching for decode optimization (~2GB cache, ~10% CPU improvement)
Improved code clarity and comment accuracy

Code Quality

Added type hints to critical functions
Removed all emojis (60+ instances) for professional tone
Removed environment-specific references
Cleaned up documentation for open-source release

Validated

Testing

All unit tests passing
Model loading: 73 shards, 391GB, no errors
Forward pass: Valid outputs, no NaN/Inf
Token generation: Natural, coherent output ('你好！👋 ')

Code Review

Comprehensive review by Claude Opus 4.5 (2 rounds)
All critical issues identified and fixed
Mathematical correctness verified
Performance optimizations validated

Performance

Measured Metrics

Model loading: 8-10 minutes (73 shards)
Forward pass: 2-5 minutes (single pass)
Tokens/second: 0.003-0.01 (CPU reference)
Memory usage: ~260GB (model + overhead + 2GB cache)
CPU utilization: ~10% improvement after optimizations

Quantization Quality

Compression: 16x vs FP32
Conversion: 30,769 weights, 0 errors
Mean quantization error: 0.14-1.8
Relative error: 18-42%
Output quality: Coherent and semantically appropriate

Technical Details

NVFP4 Format

E2M1 floating point (4 bits per value)
16 representable values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
Dual-level scaling: per-block FP8 E4M3 + global FP32
Packed storage: 2 values per uint8 byte

Architecture Support

DeepSeek V3.2 transformer (61 layers)
MLA with sparse attention indexing
MoE with 256 routed + 1 shared expert
FP8 KV cache
Message encoding for DeepSeek V3.2 format

Known Limitations

CPU-only (GPU Triton kernels incomplete)
Slow performance: 2-5 minutes per token (expected for CPU)
Memory intensive: Requires ~400GB RAM minimum
Single-sample inference only (no batching)

Future Work

Complete Triton NVFP4 kernels for GPU acceleration (100-1000x speedup)
Quality validation against FP8/FP16 baselines
Perplexity benchmarking
Batch inference support
Mixed-precision modes

Acknowledgments

Original DeepSeek V3.2 model by DeepSeek AI
NVFP4 format specification by NVIDIA
Code review and validation by Claude Opus 4.5
Quantization tools based on NVIDIA TensorRT Model Optimizer

License

MIT License (inherited from DeepSeek V3.2)