z-lab/Qwen3.5-27B-DFlash · FP8 work for base model or is 16-bit of 27B required?

FP8 work for base model or is 16-bit of 27B required?

by unoid - opened 4 days ago

Running vllm with dflash on FP8 of 27B, 15 spec num averages very low acceptance rate ~12%. spec=8 is around 25-30%. Performance at 8 is on par with MTP=3.

jianchen0311

Z Lab org 4 days ago

•

edited 4 days ago

I believe this draft model can also be used with Qwen3.5-27B-FP8, I benchmarked this draft model with both the BF16 target model and the FP8 target model on humaneval, and the acceptance length is very close.

Here are the Qwen3.5-27B results on vLLM.

Successful requests:                     164       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  389.19    
Total input tokens:                      24600     
Total generated tokens:                  165775    
Request throughput (req/s):              0.42      
Output token throughput (tok/s):         425.95    
Peak output token throughput (tok/s):    57.00     
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          489.16    
---------------Time to First Token----------------
Mean TTFT (ms):                          66.36     
Median TTFT (ms):                        65.70     
P99 TTFT (ms):                           84.54     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.27      
Median TPOT (ms):                        2.18      
P99 TPOT (ms):                           3.68      
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.30     
Median ITL (ms):                         18.35     
P99 ITL (ms):                            20.29     
---------------Speculative Decoding---------------
Acceptance rate (%):                     47.24     
Acceptance length:                       8.09      
Drafts:                                  20503     
Draft tokens:                            307545    
Accepted tokens:                         145292    
Per-position acceptance (%):
  Position 0:                            92.54     
  Position 1:                            82.47     
  Position 2:                            72.99     
  Position 3:                            64.77     
  Position 4:                            57.66     
  Position 5:                            51.52     
  Position 6:                            46.24     
  Position 7:                            41.87     
  Position 8:                            37.78     
  Position 9:                            34.20     
  Position 10:                           31.07     
  Position 11:                           28.10     
  Position 12:                           25.33     
  Position 13:                           22.50     
  Position 14:                           19.59

Here are the Qwen3.5-27B-FP8 results:

Successful requests:                     164       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  395.08    
Total input tokens:                      24600     
Total generated tokens:                  165556    
Request throughput (req/s):              0.42      
Output token throughput (tok/s):         419.05    
Peak output token throughput (tok/s):    57.00     
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          481.31    
---------------Time to First Token----------------
Mean TTFT (ms):                          91.81     
Median TTFT (ms):                        66.50     
P99 TTFT (ms):                           127.55    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.23      
Median TPOT (ms):                        2.10      
P99 TPOT (ms):                           3.75      
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.16     
Median ITL (ms):                         17.93     
P99 ITL (ms):                            20.07     
---------------Speculative Decoding---------------
Acceptance rate (%):                     46.52     
Acceptance length:                       7.98      
Drafts:                                  20754     
Draft tokens:                            311310    
Accepted tokens:                         144822    
Per-position acceptance (%):
  Position 0:                            92.42     
  Position 1:                            82.01     
  Position 2:                            72.59     
  Position 3:                            63.95     
  Position 4:                            56.51     
  Position 5:                            50.32     
  Position 6:                            45.27     
  Position 7:                            40.82     
  Position 8:                            36.81     
  Position 9:                            33.44     
  Position 10:                           30.36     
  Position 11:                           27.66     
  Position 12:                           24.73     
  Position 13:                           21.82     
  Position 14:                           19.10

unoid

4 days ago

Interesting, it must be a mis configuration on my Sm120 6000 blackwell and vllm cu130nightly.

jianchen0311

Z Lab org 4 days ago

As DFlash was just merged into vLLM, there are probably some issues. I will try to run on RTX 6000 Blackwell to see if I can reproduce your problem 👀

hampsonw

4 days ago

•

edited 2 days ago

similarly, i'm interested in if it's possible to use the parquant model instead of either BF16 or FP8 z-lab/Qwen3.5-27B-PARO

I run a 2x3090 setup and am wondering if anyone in the community has tried this or if ampere in general has been tested.

unoid

1 day ago

•

edited 1 day ago

Tested again on vllm 18.2rc1 cu130 nightly. rtx 6000 blackwell.

vllm/vllm-openai:cu130-nightly \
  /models/Qwen3.5-27B-FP8 \
  --async-scheduling \
  --quantization fp8 \
  --served-model-name Qwen3.5 \
  --tensor-parallel-size 1 \
  --dtype auto \
  --kv-cache-dtype auto \
  --trust-remote-code \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 32 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 16384 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --speculative-config '{"method": "dflash", "model": "/models/Qwen3.5-27B-DFlash", "num_speculative_tokens": 8}' \
  --max-model-len 262144

Acceptance still averaging ~20%. I tried max-num-batched-tokens 8192 and 16384. With and without multi modal.

Are there still PR's from z-lab pending merge to master?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment