Qwen3-4B

This version of Qwen3-4B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 5.2

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-4B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Convert the original Huggingface Qwen3-4B to axmodel, and then apply the w8a16 quantization to get the final axmodel for axllm runtime.

export FLOAT_MATMUL_USE_CONV_EU=1 # only support AX650, for better performance, please set this env var before running the conversion command.

# context window size 2048, prefill length 1024
pulsar2 llm_build --input_path Qwen3-4B --output_path <your path> \
--hidden_state_type bf16 --kv_cache_len 2048 --prefill_len 128 --chip AX650 -c 1 --parallel 32 \
--last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512 \
--last_kv_cache_len 640 --last_kv_cache_len 768 --last_kv_cache_len 896 --last_kv_cache_len 1024 -w s8

Support Platform

Chips w8a16 CMM Flash
AX650 4.01 tokens/sec 5.1 GiB 5.3GiB

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3-4B
cd AXERA-TECH/Qwen3-4B
hf download AXERA-TECH/Qwen3-4B --local-dir .

# structure of the downloaded files
.
└── AXERA-TECH
    └── Qwen3-4B
        ├── README.md
        ├── config.json
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── qwen3_p128_l0_together.axmodel
...
        ├── qwen3_p128_l9_together.axmodel
        ├── qwen3_post.axmodel
        └── qwen3_tokenizer.txt

2 directories, 42 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-4B/
15:25:51.178 INF Init:890 | LLM init start
tokenizer_type = 1
 97% | ###############################  |  38 /  39 [17.72s<18.19s, 2.14 count/s] init post axmodel ok,remain_cmm(4744 MB)
15:26:08.897 INF Init:1045 | max_token_len : 2048
15:26:08.897 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
15:26:08.897 INF init_groups_from_model:606 | prefill_token_num : 128
15:26:08.897 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
15:26:08.897 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
15:26:08.897 INF init_groups_from_model:831 | prefill_max_token_num : 1152
15:26:08.897 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  39 /  39 [17.72s<17.72s, 2.20 count/s] embed_selector init ok
15:26:08.898 INF load_config:282 | load config:
15:26:08.898 INF load_config:282 | {
15:26:08.898 INF load_config:282 |     "enable_repetition_penalty": false,
15:26:08.898 INF load_config:282 |     "enable_temperature": false,
15:26:08.898 INF load_config:282 |     "enable_top_k_sampling": false,
15:26:08.898 INF load_config:282 |     "enable_top_p_sampling": false,
15:26:08.898 INF load_config:282 |     "penalty_window": 20,
15:26:08.898 INF load_config:282 |     "repetition_penalty": 1.2,
15:26:08.898 INF load_config:282 |     "temperature": 0.9,
15:26:08.898 INF load_config:282 |     "top_k": 10,
15:26:08.898 INF load_config:282 |     "top_p": 0.8
15:26:08.898 INF load_config:282 | }
15:26:08.898 INF Init:1139 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
----------------------------------------
prompt >> who are you
15:26:15.337 INF SetKVCache:1437 | decode_grpid:0 prefill_grpid:1 history_cap:0 total_cap:128 symbolic_cap:1 precompute_len:0 input_num_token:22 prefer_symbolic_group:0
15:26:15.337 INF SetKVCache:1458 | current prefill_max_token_num:1152
15:26:15.460 INF SetKVCache:1462 | first run
15:26:15.469 INF Run:1553 | input token num : 22, prefill_split_num : 1
15:26:15.470 INF Run:1640 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
15:26:15.470 INF Run:1665 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
15:26:15.908 INF Run:1837 | ttft: 438.28 ms
<think>
Okay, the user asked, "who are you?" I need to respond appropriately. First, I should introduce myself clearly. I'm Qwen, a large-scale language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and helping with tasks. Also, I should highlight my multilingual support and the fact that I'm designed to be helpful and friendly. I should keep the tone positive and open for further assistance. Let me make sure the response is concise and covers all the key points without being too technical. Alright, that should do it.
</think>

Hello! I'm Qwen, a large-scale language model developed by Alibaba Cloud. I can help with answering questions, creating content, and assisting with various tasks. I support multiple languages and am designed to be helpful and friendly. How can I assist you today? 😊

15:26:59.884 NTC Run:2102 | hit eos,decode avg 4.00 token/s
15:26:59.884 INF GetKVCache:1408 | precompute_len:199, remaining:953
prompt >> /q

启动服务(OpenAI 兼容)

(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-4B/
15:31:10.226 INF Init:890 | LLM init start
tokenizer_type = 1
 97% | ###############################  |  38 /  39 [13.45s<13.80s, 2.83 count/s] init post axmodel ok,remain_cmm(4744 MB)
15:31:23.673 INF Init:1045 | max_token_len : 2048
15:31:23.673 INF Init:1048 | kv_cache_size : 1024, kv_cache_num: 2048
15:31:23.673 INF init_groups_from_model:606 | prefill_token_num : 128
15:31:23.673 INF init_groups_from_model:820 | decode grp: 0, gid: 0, max_token_len : 2048
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
15:31:23.673 INF init_groups_from_model:824 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
15:31:23.673 INF init_groups_from_model:831 | prefill_max_token_num : 1152
15:31:23.674 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  39 /  39 [13.45s<13.45s, 2.90 count/s] embed_selector init ok
15:31:23.674 INF load_config:282 | load config:
15:31:23.674 INF load_config:282 | {
15:31:23.674 INF load_config:282 |     "enable_repetition_penalty": false,
15:31:23.674 INF load_config:282 |     "enable_temperature": false,
15:31:23.674 INF load_config:282 |     "enable_top_k_sampling": false,
15:31:23.674 INF load_config:282 |     "enable_top_p_sampling": false,
15:31:23.674 INF load_config:282 |     "penalty_window": 20,
15:31:23.674 INF load_config:282 |     "repetition_penalty": 1.2,
15:31:23.674 INF load_config:282 |     "temperature": 0.9,
15:31:23.674 INF load_config:282 |     "top_k": 10,
15:31:23.674 INF load_config:282 |     "top_p": 0.8
15:31:23.674 INF load_config:282 | }
15:31:23.674 INF Init:1139 | LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-4B'...
API URLs:
  GET  http://127.0.0.1:8000/health
  GET  http://127.0.0.1:8000/v1/models
  POST http://127.0.0.1:8000/v1/chat/completions
  GET  http://10.126.29.54:8000/health
  GET  http://10.126.29.54:8000/v1/models
  POST http://10.126.29.54:8000/v1/chat/completions
  GET  http://172.17.0.1:8000/health
  GET  http://172.17.0.1:8000/v1/models
  POST http://172.17.0.1:8000/v1/chat/completions
Aliases:
  GET  http://127.0.0.1:8000/models
  POST http://127.0.0.1:8000/chat/completions
  GET  http://10.126.29.54:8000/models
  POST http://10.126.29.54:8000/chat/completions
  GET  http://172.17.0.1:8000/models
  POST http://172.17.0.1:8000/chat/completions
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-4B

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-4B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-4B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print(" ")
Downloads last month
92
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-4B

Finetuned
Qwen/Qwen3-4B
Finetuned
(621)
this model

Collection including AXERA-TECH/Qwen3-4B