Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb?
Hi Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb? I've never done a quant before but willing to try. wondering if we can quantize more layers to get the size down a tad bit more?
You could try quantizing the indexer but my intuition says you probably don't want to. I think this is about as small as you can get with nvfp4 without really hurting model performance. If you give up on gpu acceleration you could go smaller though with llama.cpp style quantization.
It should work in vllm with sm100, unfortunately due to how nvidia decided to segment their consumer vs datacenter blackwell cards much of the code in triton/deep gem/etc doesn't properly support sm120. The vllm hackery was mostly straight forward but deep gemm (https://github.com/deepseek-ai/DeepGEMM) required extensive work to even get something working and is still a ways off from something I would try to get merged. This is why I only provided the cpu reference impl for validation and experimentation with this model. Hopefully with time sm120 (rtx pro 6000 blackwell) will get better support from projects like deepgemm/triton/vllm/sglang/etc
I uploaded https://hub.docker.com/repository/docker/eous/vllm-sm120/general which has my sm120 hacks, it is very mvp/research and will probably not work. Though just tested the model and with a smaller context you should be able to fit this model on 4 96gb gpu's.
@eousphoros almost!
NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory 2 | #include <deep_gemm/impls/sm120_fp8_mqa_logits.cuh> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated.
is this maybe because i used the awq variation instead of nvfp4? its a bit smaller thats why
edit: oooh i need to install DeeeGEMM I see ok i need also to edit the install script to use command python3 vs just python and add --force-reinstall
ok Successfully installed deep-gemm-2.2.0+local
edit2: still get NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory
Ah woops, forgot to copy the decode kernel into the container. I pushed a new container up. Also no idea if this will work with AWQ, it barely works with my nvfp4 quant.
(APIServer pid=1) INFO 12-05 16:23:56 [loggers.py:248] Engine 000: Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Don't expect this to be fast, but it is faster then cpu inference.
oh wow yeah thats not fast. I'm stil getting OOM errors when i get to the KV cache part. using --enforce-eager and trying to pipeline parallel with offloading a few layers to 5090s like this VLLM_PP_LAYER_PARTITION=15,13,13,13,3,4. will keep trying. by the way did you update again just about less than an hour ago?
are you testing on 8x96gb?
on the AWQ it loads and i get AttributeError: '_OpNamespace' '_flashmla_C' object has no attribute 'sparse_prefill_fwd' when testing a prompt with curl
same on NVFP4 with pipeline parallelism I get AttributeError: '_OpNamespace' '_flashmla_C' object has no attribute 'sparse_prefill_fwd'
Yeh Ive been pushing as I do new builds and Ive been working on it all day. Currently upto speed input: 2.32 toks/s, output: 0.33 toks/s, think theres a bit more head room.
Hah I wish, I'm gpu poor and only have 2x gpu's so I'm offloading 150GB off to cpu.
@willfalco I found clicking on tags and then digest, I see the docker file or all the commands to build it
OMP_NUM_THREADS=64 vllm serve /model --tensor-parallel-size 2 --cpu-offload-gb 150 --enforce-eager
thank you @eousphoros
do you have Dockerfile and other things somewhere like github?
No not yet, doubt I will bother trying to get any of this merged but Ill post it up as forks in my github for historical reference once I get it into a better state.
think I got a time out error and was only offloading 8gb to cpu
(APIServer pid=1358) INFO 12-05 18:50:38 [launcher.py:46] Route: /pooling, Methods: POST (APIServer pid=1358) INFO: Started server process [1358] (APIServer pid=1358) INFO: Waiting for application startup. (APIServer pid=1358) INFO: Application startup complete. (APIServer pid=1358) INFO 12-05 18:52:02 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (Worker_TP2 pid=1390) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP0 pid=1388) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP3 pid=1391) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (Worker_TP1 pid=1389) WARNING 12-05 18:52:03 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement (APIServer pid=1358) INFO 12-05 18:52:08 [loggers.py:248] Engine 000: Avg prompt throughput: 1.5 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (APIServer pid=1358) INFO 12-05 18:52:18 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (EngineCore_DP0 pid=1374) INFO 12-05 18:53:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:54:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:55:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=1374) INFO 12-05 18:56:03 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
Ah yeh, I introduced a dead lock/infinite loop in the last build I am working on tracking down. It appears to be in the decode kernel.
yeh it hangs the exact same way in both the awq and the nvfp4
``(APIServer pid=961) INFO: Started server process [961]
(APIServer pid=961) INFO: Waiting for application startup.
(APIServer pid=961) INFO: Application startup complete.
(APIServer pid=961) INFO 12-05 19:43:51 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(Worker_TP1 pid=992) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP3 pid=994) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP2 pid=993) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(Worker_TP0 pid=991) WARNING 12-05 19:43:52 [flashmla_sparse.py:433] padding num_heads to 64 due to sparse attn kernel requirement
(APIServer pid=961) INFO 12-05 19:43:58 [loggers.py:248] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=961) INFO 12-05 19:44:08 [loggers.py:248] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=961) INFO: 127.0.0.1:40102 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=961) INFO: 127.0.0.1:40104 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=961) INFO: 127.0.0.1:40108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=977) INFO 12-05 19:44:52 [shm_broadcast.py:501] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).``
@eousphoros did you put your two gpus into compute mode and run 8 24gb slices for -tp 8? or how are you testing it? im still getting deadlock it seems. not sure if im just not launching it properly?
MATH START: Block 70
MATH START: Block 146
MATH START: Block 155
MATH START: Block 134
MATH START: Block 44
TMA EXIT: Block 90, q_iter=0, kv_iter=0
TMA EXIT: Block 83, q_iter=0, kv_iter=0
TMA EXIT: Block 82, q_iter=0, kv_iter=0
TMA EXIT: Block 106, q_iter=0, kv_iter=0
TMA EXIT: Block 91, q_iter=0, kv_iter=0
TMA EXIT: Block 107, q_iter=0, kv_iter=0
TMA EXIT: Block 89, q_iter=0, kv_iter=0
TMA EXIT: Block 99, q_iter=0, kv_iter=0``
...
MATH EXIT: Block 54, q_iter=0, kv_iter=0
MATH EXIT: Block 47, q_iter=0, kv_iter=0
(APIServer pid=9) INFO 12-05 22:38:40 [loggers.py:248] Engine 000: Avg prompt throughput: 0.6 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%``
(EngineCore_DP0 pid=145) ERROR 12-05 22:43:33 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.11.2.dev528+g8aaa81b35) with config: model='/mnt/2king/models/eousphoros/DeepSeek-V3.2-NVFP4/', speculative_config=None, tokenizer='/mnt/2king/models/eousphoros/DeepSeek-V3.2-NVFP4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False), seed=0, served_model_name=deepseek, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None},
ai says: The FlashMLA SM120 sparse attention kernel needs transaction barrier operations added.
not sure if its right though
AI would be wrong, "deadlock" was due to threads getting stuck in a tight loop retrying attempts to get registers. Though something else is broken now after fixing that issue. Working through it now but at least the "deadlock" is resolved.
- Check if sm120/prefill/sparse/fwd.cu exists
- Compare with sm100/prefill/sparse/fwd.cu (lines 281-330)
- Ensure SM120 has equivalent barrier operations:
- plan.bar_k_valid_free[k%NUM_BUFS].arrive()
- plan.bar_p_free[k%NUM_BUFS].arrive(0u)
- SM120 needs different barrier instructions than SM100's tcgen05
- SM100: tcgen05.* instructions
- SM120: mbarrier.* instructions (like in flashinfer's mla_sm120.cu)
Appreciate the help, already have sm120 implemented with mma.sync and what not. Just takes some time. Just found a vllm bug in my branch that was breaking things where vllm was not passing its config down to TP workers. The fun we have when dealing with larger systems of complex interconnected components.
nice will test the one u just pushed
doesn't crash but no output?
~$ curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 50,
"temperature": 0.7
}' | jq
{
"id": "chatcmpl-a2d7c2fc852cab6e",
"object": "chat.completion",
"created": 1765063244,
"model": "deepseek",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null,
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 15,
"total_tokens": 65,
"completion_tokens": 50,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}