DeepSeek-V3.2-NVFP4 / CHANGELOG.md
eousphoros's picture
Upload CHANGELOG.md with huggingface_hub
17a0e58 verified

Changelog - NVFP4 Reference Implementation

[1.0.0] - December 3, 2025

Added

Core Implementation

  • Complete NVFP4 CPU inference for DeepSeek V3.2 (671B parameters)
  • NVFP4 dequantization kernels with dual-level scaling
  • Multi-Head Latent Attention (MLA) with NVFP4 support
  • Mixture of Experts (MoE) with 256 experts
  • FP8 and BF16 fallback support
  • Sharded weight loading (73 shards, 391GB)

Test Suite

  • test_nvfp4_kernel.py - NVFP4 math unit tests (5 tests, all passing)
  • test_model_loading.py - Weight loading integration tests
  • test_forward_pass.py - Forward pass validation tests
  • test_minimal_generation.py - Token generation tests

Documentation

  • README.md - User guide with quick start and examples
  • IMPLEMENTATION_SUMMARY.md - Technical implementation details
  • COMPLETION_REPORT.md - Project completion summary
  • CODE_REVIEW_FIXES.md - Code review documentation
  • FINAL_SUMMARY.md - Final project status

Tools

  • tools/fp8_to_nvfp4_streaming.py - FP8 to NVFP4 conversion script
  • tools/README.md - Quantization tools documentation
  • encoding/README.md - Message encoding documentation

Fixed

Critical Bugs

  • FP8 scale linking after load_state_dict() (generate.py:213-234)
  • NVFP4 attribute safety checks for mixed quantization (model.py:685)
  • NVFP4 decode path value projection computation (model.py:674-706)
    • Was computing projections from current token only
    • Now correctly projects attended context from KV cache
    • Verified correct by Claude Opus 4.5 code review

Performance Issues

  • Added LUT device caching to eliminate repeated transfers (nvfp4_kernel.py:29-39)
  • Added wkv_b weight caching for decode optimization (~2GB cache, ~10% CPU improvement)
  • Improved code clarity and comment accuracy

Code Quality

  • Added type hints to critical functions
  • Removed all emojis (60+ instances) for professional tone
  • Removed environment-specific references
  • Cleaned up documentation for open-source release

Validated

Testing

  • All unit tests passing
  • Model loading: 73 shards, 391GB, no errors
  • Forward pass: Valid outputs, no NaN/Inf
  • Token generation: Natural, coherent output ('你好!👋 ')

Code Review

  • Comprehensive review by Claude Opus 4.5 (2 rounds)
  • All critical issues identified and fixed
  • Mathematical correctness verified
  • Performance optimizations validated

Performance

Measured Metrics

  • Model loading: 8-10 minutes (73 shards)
  • Forward pass: 2-5 minutes (single pass)
  • Tokens/second: 0.003-0.01 (CPU reference)
  • Memory usage: ~260GB (model + overhead + 2GB cache)
  • CPU utilization: ~10% improvement after optimizations

Quantization Quality

  • Compression: 16x vs FP32
  • Conversion: 30,769 weights, 0 errors
  • Mean quantization error: 0.14-1.8
  • Relative error: 18-42%
  • Output quality: Coherent and semantically appropriate

Technical Details

NVFP4 Format

  • E2M1 floating point (4 bits per value)
  • 16 representable values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
  • Dual-level scaling: per-block FP8 E4M3 + global FP32
  • Packed storage: 2 values per uint8 byte

Architecture Support

  • DeepSeek V3.2 transformer (61 layers)
  • MLA with sparse attention indexing
  • MoE with 256 routed + 1 shared expert
  • FP8 KV cache
  • Message encoding for DeepSeek V3.2 format

Known Limitations

  • CPU-only (GPU Triton kernels incomplete)
  • Slow performance: 2-5 minutes per token (expected for CPU)
  • Memory intensive: Requires ~400GB RAM minimum
  • Single-sample inference only (no batching)

Future Work

  • Complete Triton NVFP4 kernels for GPU acceleration (100-1000x speedup)
  • Quality validation against FP8/FP16 baselines
  • Perplexity benchmarking
  • Batch inference support
  • Mixed-precision modes

Acknowledgments

  • Original DeepSeek V3.2 model by DeepSeek AI
  • NVFP4 format specification by NVIDIA
  • Code review and validation by Claude Opus 4.5
  • Quantization tools based on NVIDIA TensorRT Model Optimizer

License

MIT License (inherited from DeepSeek V3.2)