FP8 work for base model or is 16-bit of 27B required?
Running vllm with dflash on FP8 of 27B, 15 spec num averages very low acceptance rate ~12%. spec=8 is around 25-30%. Performance at 8 is on par with MTP=3.
I believe this draft model can also be used with Qwen3.5-27B-FP8, I benchmarked this draft model with both the BF16 target model and the FP8 target model on humaneval, and the acceptance length is very close.
Here are the Qwen3.5-27B results on vLLM.
Successful requests: 164
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 389.19
Total input tokens: 24600
Total generated tokens: 165775
Request throughput (req/s): 0.42
Output token throughput (tok/s): 425.95
Peak output token throughput (tok/s): 57.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 489.16
---------------Time to First Token----------------
Mean TTFT (ms): 66.36
Median TTFT (ms): 65.70
P99 TTFT (ms): 84.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.27
Median TPOT (ms): 2.18
P99 TPOT (ms): 3.68
---------------Inter-token Latency----------------
Mean ITL (ms): 18.30
Median ITL (ms): 18.35
P99 ITL (ms): 20.29
---------------Speculative Decoding---------------
Acceptance rate (%): 47.24
Acceptance length: 8.09
Drafts: 20503
Draft tokens: 307545
Accepted tokens: 145292
Per-position acceptance (%):
Position 0: 92.54
Position 1: 82.47
Position 2: 72.99
Position 3: 64.77
Position 4: 57.66
Position 5: 51.52
Position 6: 46.24
Position 7: 41.87
Position 8: 37.78
Position 9: 34.20
Position 10: 31.07
Position 11: 28.10
Position 12: 25.33
Position 13: 22.50
Position 14: 19.59
Here are the Qwen3.5-27B-FP8 results:
Successful requests: 164
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 395.08
Total input tokens: 24600
Total generated tokens: 165556
Request throughput (req/s): 0.42
Output token throughput (tok/s): 419.05
Peak output token throughput (tok/s): 57.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 481.31
---------------Time to First Token----------------
Mean TTFT (ms): 91.81
Median TTFT (ms): 66.50
P99 TTFT (ms): 127.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.23
Median TPOT (ms): 2.10
P99 TPOT (ms): 3.75
---------------Inter-token Latency----------------
Mean ITL (ms): 18.16
Median ITL (ms): 17.93
P99 ITL (ms): 20.07
---------------Speculative Decoding---------------
Acceptance rate (%): 46.52
Acceptance length: 7.98
Drafts: 20754
Draft tokens: 311310
Accepted tokens: 144822
Per-position acceptance (%):
Position 0: 92.42
Position 1: 82.01
Position 2: 72.59
Position 3: 63.95
Position 4: 56.51
Position 5: 50.32
Position 6: 45.27
Position 7: 40.82
Position 8: 36.81
Position 9: 33.44
Position 10: 30.36
Position 11: 27.66
Position 12: 24.73
Position 13: 21.82
Position 14: 19.10
Interesting, it must be a mis configuration on my Sm120 6000 blackwell and vllm cu130nightly.
As DFlash was just merged into vLLM, there are probably some issues. I will try to run on RTX 6000 Blackwell to see if I can reproduce your problem π
similarly, i'm interested in if it's possible to use the parquant model instead of either BF16 or FP8 z-lab/Qwen3.5-27B-PARO
I run a 2x3090 setup and am wondering if anyone in the community has tried this or if ampere in general has been tested.
Tested again on vllm 18.2rc1 cu130 nightly. rtx 6000 blackwell.
vllm/vllm-openai:cu130-nightly \
/models/Qwen3.5-27B-FP8 \
--async-scheduling \
--quantization fp8 \
--served-model-name Qwen3.5 \
--tensor-parallel-size 1 \
--dtype auto \
--kv-cache-dtype auto \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--max-num-seqs 32 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method": "dflash", "model": "/models/Qwen3.5-27B-DFlash", "num_speculative_tokens": 8}' \
--max-model-len 262144
Acceptance still averaging ~20%. I tried max-num-batched-tokens 8192 and 16384. With and without multi modal.
Are there still PR's from z-lab pending merge to master?