An upgrade in quality and a mixed bag of ... 8x RTX3090

by dehnhaide - opened 21 days ago

Discussion

dehnhaide

21 days ago

•

edited 21 days ago

So, here we go. Thoughts and rants for v2 of this model release:

with vllm 0.19.1

vllm serve intel/Step-3.5-Flash-int4-mixed-AutoRound/
--served-model-name "intel/Step-3.5-Flash-int4-mixed-AutoRound"
--host 0.0.0.0
--port 5005
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 131072
--gpu-memory-utilization 0.92
--max-num-seqs 2
--max-num-batched-tokens 4192
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--disable-uvicorn-access-log
--hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'

BUGS:
"rope_theta": 5000000.0--> has to be added config.json to fix error:
KeyError: "Missing required keys in rope_parameters for 'rope_type'='llama3': {'rope_theta'}"

3x "sliding_attention" references have to be deleted from the config.json to fix error:
ValueError: num_hidden_layers (45) must be equal to the number of layer types (48)

Enabling MTP fails the model:
" --hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' "
--> NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.

ANNOYANCES:

poor layers partitionings "Hidden layers were unevenly partitioned: [11,11,12,11]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable" make the load on multi GPU look ... weird & imbalanced

Testing conditions:

vibe coding sprint wiht Opencode with the following prompt:
"Create a simple flask application with a simple HTML, CSS and JS frontend, with persistent storage (SQLite). It should manage todos based on due dates (date, without hours).
Add these aditional features: 4 levels of priority, categories/tags for todos (inluding the following predefined: Work, Personal, Shopping, Travel, Fun, Health, Family), drag-and-drop reordering, dark/light theme toggle, export/import functionality (json only). Also, go for a posh. elegant design (custom CSS with a "posh" aesthetic - elegant fonts, gradients, shadows)."

Observations:

clear upgrade vs previous quant (no more unknown characters + chinese)
model is fast, 2k toks PP / 80-85 toks TG
Overall quant consistency: high
Overall quant precision (in content generation): low
Coding sprint final product STATUS:
--> first iteration: FAILED
--> second iteration: site up and ok looking, main functionalities are broken --> FAILED
--> third iteration: site up and ok looking, main functionalities are broken --> FAILED

Final status: FAILED

Overall:

the model feels a bit lightheaded, less precise vs similar quants (aessedai/Step-3.5-Flash-Base-Midtrain-Q5_K_M) --> BUT it could be that the GGUF quant is from a second iteration / release of the model (Base-Midtrain) that might be a bit "smarter"

I would love to be able to test also a similar quant of this version of the model "stepfun-ai/Step-3.5-Flash-Base-Midtrain"

Thanks for your efforts!

xinhe

Intel org 17 days ago

Hi @dehnhaide
In my memory, step3p5 is not compatible with the latest vLLM even using the original model.
That's why I highlight that I'm using vllm==0.18.0.
Currently, we have no plans to quantize Step-3.5-Flash-Base-Midtrain, as it has a relatively low number of downloads.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment