eousphoros/DeepSeek-V3.2-NVFP4 · Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb?

4 days ago

Hi Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb? I've never done a quant before but willing to try. wondering if we can quantize more layers to get the size down a tad bit more?

eousphoros

Owner 3 days ago

You could try quantizing the indexer but my intuition says you probably don't want to. I think this is about as small as you can get with nvfp4 without really hurting model performance. If you give up on gpu acceleration you could go smaller though with llama.cpp style quantization.

Fernanda24

3 days ago

wait so will this work at all with cuda gpus? @eousphoros

eousphoros

Owner 3 days ago

It should work in vllm with sm100, unfortunately due to how nvidia decided to segment their consumer vs datacenter blackwell cards much of the code in triton/deep gem/etc doesn't properly support sm120. The vllm hackery was mostly straight forward but deep gemm (https://github.com/deepseek-ai/DeepGEMM) required extensive work to even get something working and is still a ways off from something I would try to get merged. This is why I only provided the cpu reference impl for validation and experimentation with this model. Hopefully with time sm120 (rtx pro 6000 blackwell) will get better support from projects like deepgemm/triton/vllm/sglang/etc

eousphoros

Owner 1 day ago

I uploaded https://hub.docker.com/repository/docker/eous/vllm-sm120/general which has my sm120 hacks, it is very mvp/research and will probably not work. Though just tested the model and with a smaller context you should be able to fit this model on 4 96gb gpu's.

Fernanda24

1 day ago

•

edited 1 day ago

@eousphoros almost!

NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory 2 | #include <deep_gemm/impls/sm120_fp8_mqa_logits.cuh> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated.

is this maybe because i used the awq variation instead of nvfp4? its a bit smaller thats why

edit: oooh i need to install DeeeGEMM I see ok i need also to edit the install script to use command python3 vs just python and add --force-reinstall

ok Successfully installed deep-gemm-2.2.0+local

edit2: still get NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory

eousphoros

Owner 1 day ago

Ah woops, forgot to copy the decode kernel into the container. I pushed a new container up. Also no idea if this will work with AWQ, it barely works with my nvfp4 quant.

eousphoros

Owner 1 day ago

(APIServer pid=1) INFO 12-05 16:23:56 [loggers.py:248] Engine 000: Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Don't expect this to be fast, but it is faster then cpu inference.

Fernanda24

1 day ago

•

edited 1 day ago

oh wow yeah thats not fast. I'm stil getting OOM errors when i get to the KV cache part. using --enforce-eager and trying to pipeline parallel with offloading a few layers to 5090s like this VLLM_PP_LAYER_PARTITION=15,13,13,13,3,4. will keep trying. by the way did you update again just about less than an hour ago?

Fernanda24

1 day ago

•

edited about 24 hours ago

are you testing on 8x96gb?

on the AWQ it loads and i get AttributeError: '_OpNamespace' '_flashmla_C' object has no attribute 'sparse_prefill_fwd' when testing a prompt with curl
same on NVFP4 with pipeline parallelism I get AttributeError: '_OpNamespace' '_flashmla_C' object has no attribute 'sparse_prefill_fwd'

eousphoros

Owner about 24 hours ago

Yeh Ive been pushing as I do new builds and Ive been working on it all day. Currently upto speed input: 2.32 toks/s, output: 0.33 toks/s, think theres a bit more head room.

Hah I wish, I'm gpu poor and only have 2x gpu's so I'm offloading 150GB off to cpu.

Fernanda24

about 24 hours ago

@eousphoros can you share your launch command? not sure how to offload to cpu?

willfalco

about 23 hours ago

thank you @eousphoros
do you have Dockerfile and other things somewhere like github?

Fernanda24

about 23 hours ago

@willfalco I found clicking on tags and then digest, I see the docker file or all the commands to build it

eousphoros

Owner about 22 hours ago

OMP_NUM_THREADS=64 vllm serve /model --tensor-parallel-size 2 --cpu-offload-gb 150 --enforce-eager

eousphoros

Owner about 22 hours ago

thank you @eousphoros
do you have Dockerfile and other things somewhere like github?

No not yet, doubt I will bother trying to get any of this merged but Ill post it up as forks in my github for historical reference once I get it into a better state.

Fernanda24

about 22 hours ago

think I got a time out error and was only offloading 8gb to cpu

(APIServer pid=1358) INFO 12-05 18:50:38 [launcher.py:46] Route: /pooling, Methods: POST (APIServer pid=1358) INFO: Started server process [1358] (APIServer pid=1358) INFO: Waiting for application startup. (APIServer pid=1358) INFO: Application startup complete. (APIServer pid=1358) INFO 12-05 18:52:02 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (Worker_TP2 pid=1390) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP0 pid=1388) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP3 pid=1391) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP1 pid=1389) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (APIServer pid=1358) INFO 12-05 18:52:08 [loggers.py:248] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (APIServer pid=1358) INFO 12-05 18:52:18 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (EngineCore_DP0 pid=1374) INFO 12-05 18:53:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:54:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:55:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:56:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

eousphoros

Owner about 22 hours ago

Ah yeh, I introduced a dead lock/infinite loop in the last build I am working on tracking down. It appears to be in the decode kernel.

Fernanda24

about 22 hours ago

•

edited about 22 hours ago

yeh it hangs the exact same way in both the awq and the nvfp4

``(APIServer pid=961) INFO: Started server process [961]

(APIServer pid=961) INFO: Waiting for application startup.

(APIServer pid=961) INFO: Application startup complete.

(APIServer pid=961) INFO 12-05 19:43:51 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.

(Worker_TP1 pid=992) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP3 pid=994) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP2 pid=993) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP0 pid=991) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement

(APIServer pid=961) INFO 12-05 19:43:58 [loggers.py:248] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

(APIServer pid=961) INFO 12-05 19:44:08 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

(APIServer pid=961) INFO: 127.0.0.1:40102 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=961) INFO: 127.0.0.1:40104 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=961) INFO: 127.0.0.1:40108 - "POST /v1/chat/completions HTTP/1.1" 200 OK

(EngineCore_DP0 pid=977) INFO 12-05 19:44:52 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).``

Fernanda24

about 18 hours ago

@eousphoros did you put your two gpus into compute mode and run 8 24gb slices for -tp 8? or how are you testing it? im still getting deadlock it seems. not sure if im just not launching it properly?

Fernanda24

about 18 hours ago

•

edited about 18 hours ago

MATH START: Block 70
MATH START: Block 146
MATH START: Block 155
MATH START: Block 134
MATH START: Block 44
TMA EXIT: Block 90, q_iter=0, kv_iter=0
TMA EXIT: Block 83, q_iter=0, kv_iter=0
TMA EXIT: Block 82, q_iter=0, kv_iter=0
TMA EXIT: Block 106, q_iter=0, kv_iter=0
TMA EXIT: Block 91, q_iter=0, kv_iter=0
TMA EXIT: Block 107, q_iter=0, kv_iter=0
TMA EXIT: Block 89, q_iter=0, kv_iter=0
TMA EXIT: Block 99, q_iter=0, kv_iter=0``

...

MATH EXIT: Block 54, q_iter=0, kv_iter=0
MATH EXIT: Block 47, q_iter=0, kv_iter=0
(APIServer pid=9) INFO 12-05 22:38:40 [loggers.py:248] Engine 000: Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%``

(EngineCore_DP0 pid=145) ERROR 12-05 22:43:33 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2.dev528+g8aaa81b35) with config: model='/mnt/2king/models/eousphoros/DeepSeek-V3.2-NVFP4/', speculative_config=None, tokenizer='/mnt/2king/models/eousphoros/DeepSeek-V3.2-NVFP4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False), seed=0, served_model_name=deepseek, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None},

Fernanda24

about 18 hours ago

•

edited about 17 hours ago

ai says: The FlashMLA SM120 sparse attention kernel needs transaction barrier operations added.
not sure if its right though

eousphoros

Owner about 15 hours ago

AI would be wrong, "deadlock" was due to threads getting stuck in a tight loop retrying attempts to get registers. Though something else is broken now after fixing that issue. Working through it now but at least the "deadlock" is resolved.

eousphoros

Owner about 15 hours ago

This comment has been hidden

Fernanda24

about 4 hours ago

•

edited about 4 hours ago

Check if sm120/prefill/sparse/fwd.cu exists
Compare with sm100/prefill/sparse/fwd.cu (lines 281-330)
Ensure SM120 has equivalent barrier operations:
- plan.bar_k_valid_free[k%NUM_BUFS].arrive()
- plan.bar_p_free[k%NUM_BUFS].arrive(0u)
SM120 needs different barrier instructions than SM100's tcgen05
- SM100: tcgen05.* instructions
- SM120: mbarrier.* instructions (like in flashinfer's mla_sm120.cu)

eousphoros

Owner about 3 hours ago

Appreciate the help, already have sm120 implemented with mma.sync and what not. Just takes some time. Just found a vllm bug in my branch that was breaking things where vllm was not passing its config down to TP workers. The fun we have when dealing with larger systems of complex interconnected components.

Fernanda24

about 3 hours ago

•

edited about 2 hours ago

nice will test the one u just pushed

Fernanda24

about 2 hours ago

doesn't crash but no output?

~$ curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 50,
"temperature": 0.7
}' | jq
{
"id": "chatcmpl-a2d7c2fc852cab6e",
"object": "chat.completion",
"created": 1765063244,
"model": "deepseek",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null,
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 15,
"total_tokens": 65,
"completion_tokens": 50,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}