An upgrade in quality and a mixed bag of ... 8x RTX3090
So, here we go. Thoughts and rants for v2 of this model release:
with vllm 0.19.1
vllm serve intel/Step-3.5-Flash-int4-mixed-AutoRound/
--served-model-name "intel/Step-3.5-Flash-int4-mixed-AutoRound"
--host 0.0.0.0
--port 5005
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 131072
--gpu-memory-utilization 0.92
--max-num-seqs 2
--max-num-batched-tokens 4192
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--disable-uvicorn-access-log
--hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'
BUGS:
"rope_theta": 5000000.0--> has to be added config.json to fix error:
KeyError: "Missing required keys in rope_parameters for 'rope_type'='llama3': {'rope_theta'}"
3x "sliding_attention" references have to be deleted from the config.json to fix error:
ValueError: num_hidden_layers (45) must be equal to the number of layer types (48)
Enabling MTP fails the model:
" --hf-overrides '{"num_nextn_predict_layers": 1}'
--speculative_config '{"method": "step3p5_mtp", "num_speculative_tokens": 1}' "
--> NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.
ANNOYANCES:
- poor layers partitionings "Hidden layers were unevenly partitioned: [11,11,12,11]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable" make the load on multi GPU look ... weird & imbalanced
Testing conditions:
vibe coding sprint wiht Opencode with the following prompt:
"Create a simple flask application with a simple HTML, CSS and JS frontend, with persistent storage (SQLite). It should manage todos based on due dates (date, without hours).
Add these aditional features: 4 levels of priority, categories/tags for todos (inluding the following predefined: Work, Personal, Shopping, Travel, Fun, Health, Family), drag-and-drop reordering, dark/light theme toggle, export/import functionality (json only). Also, go for a posh. elegant design (custom CSS with a "posh" aesthetic - elegant fonts, gradients, shadows)."
Observations:
- clear upgrade vs previous quant (no more unknown characters + chinese)
- model is fast, 2k toks PP / 80-85 toks TG
- Overall quant consistency: high
- Overall quant precision (in content generation): low
Coding sprint final product STATUS:
--> first iteration: FAILED
--> second iteration: site up and ok looking, main functionalities are broken --> FAILED
--> third iteration: site up and ok looking, main functionalities are broken --> FAILED
Final status: FAILED
Overall:
- the model feels a bit lightheaded, less precise vs similar quants (aessedai/Step-3.5-Flash-Base-Midtrain-Q5_K_M) --> BUT it could be that the GGUF quant is from a second iteration / release of the model (Base-Midtrain) that might be a bit "smarter"
I would love to be able to test also a similar quant of this version of the model "stepfun-ai/Step-3.5-Flash-Base-Midtrain"
Thanks for your efforts!
Hi @dehnhaide
In my memory, step3p5 is not compatible with the latest vLLM even using the original model.
That's why I highlight that I'm using vllm==0.18.0.
Currently, we have no plans to quantize Step-3.5-Flash-Base-Midtrain, as it has a relatively low number of downloads.

