An upgrade in quality and a mixed bag of ... 8x RTX3090

#3
by dehnhaide - opened

So, here we go. Thoughts and rants for v2 of this model release:

with vllm 0.19.1

vllm serve intel/Step-3.5-Flash-int4-mixed-AutoRound/
--served-model-name "intel/Step-3.5-Flash-int4-mixed-AutoRound"
--host 0.0.0.0
--port 5005
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 131072
--gpu-memory-utilization 0.92
--max-num-seqs 2
--max-num-batched-tokens 4192
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--disable-uvicorn-access-log
--hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'

BUGS:
"rope_theta": 5000000.0--> has to be added config.json to fix error:
KeyError: "Missing required keys in rope_parameters for 'rope_type'='llama3': {'rope_theta'}"

3x "sliding_attention" references have to be deleted from the config.json to fix error:
ValueError: num_hidden_layers (45) must be equal to the number of layer types (48)

Enabling MTP fails the model:
" --hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' "
--> NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.

ANNOYANCES:

  • poor layers partitionings "Hidden layers were unevenly partitioned: [11,11,12,11]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable" make the load on multi GPU look ... weird & imbalanced

Screenshot from 2026-04-14 19-56-01

Testing conditions:

vibe coding sprint wiht Opencode with the following prompt:
"Create a simple flask application with a simple HTML, CSS and JS frontend, with persistent storage (SQLite). It should manage todos based on due dates (date, without hours).
Add these aditional features: 4 levels of priority, categories/tags for todos (inluding the following predefined: Work, Personal, Shopping, Travel, Fun, Health, Family), drag-and-drop reordering, dark/light theme toggle, export/import functionality (json only). Also, go for a posh. elegant design (custom CSS with a "posh" aesthetic - elegant fonts, gradients, shadows)."

Observations:

  • clear upgrade vs previous quant (no more unknown characters + chinese)
  • model is fast, 2k toks PP / 80-85 toks TG
  • Overall quant consistency: high
  • Overall quant precision (in content generation): low
    Coding sprint final product STATUS:
    --> first iteration: FAILED
    --> second iteration: site up and ok looking, main functionalities are broken --> FAILED
    --> third iteration: site up and ok looking, main functionalities are broken --> FAILED

Screenshot from 2026-04-15 18-30-13

Final status: FAILED

Overall:

  • the model feels a bit lightheaded, less precise vs similar quants (aessedai/Step-3.5-Flash-Base-Midtrain-Q5_K_M) --> BUT it could be that the GGUF quant is from a second iteration / release of the model (Base-Midtrain) that might be a bit "smarter"

I would love to be able to test also a similar quant of this version of the model "stepfun-ai/Step-3.5-Flash-Base-Midtrain"

Thanks for your efforts!

Intel org

Hi @dehnhaide
In my memory, step3p5 is not compatible with the latest vLLM even using the original model.
That's why I highlight that I'm using vllm==0.18.0.
Currently, we have no plans to quantize Step-3.5-Flash-Base-Midtrain, as it has a relatively low number of downloads.

Sign up or log in to comment