Update README.md
Browse files
README.md
CHANGED
|
@@ -57,20 +57,21 @@ Unlike most compact models, Jamba Reasoning 3B supports extremely long contexts.
|
|
| 57 |
|
| 58 |
## Quickstart
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
|
|
|
| 62 |
|
| 63 |
### **Run the model with vLLM**
|
| 64 |
|
| 65 |
For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
|
| 66 |
|
| 67 |
-
```
|
| 68 |
pip install vllm>=0.10.2
|
| 69 |
```
|
| 70 |
|
| 71 |
Using vllm in online server mode:
|
| 72 |
|
| 73 |
-
```
|
| 74 |
vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
|
| 75 |
```
|
| 76 |
|
|
@@ -134,92 +135,7 @@ outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.devi
|
|
| 134 |
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 135 |
print(generated_text)
|
| 136 |
```
|
| 137 |
-
|
| 138 |
-
## **How to Run This Model Locally**
|
| 139 |
-
|
| 140 |
-
You can run Jamba Reasoning 3B on your own machine using popular lightweight runtimes. This makes it possible to experiment with long-context reasoning without relying on cloud infrastructure.
|
| 141 |
-
|
| 142 |
-
- **Supported runtimes**: [llama.cpp](https://github.com/ggml-org/llama.cpp), [LM Studio](https://lmstudio.ai/), and [Ollama](https://ollama.com/).
|
| 143 |
-
- **Quantizations**: Multiple quantization levels are provided to shrink the model size.
|
| 144 |
-
- Full precision FP16 GGUF - **5.96** GB
|
| 145 |
-
- 4 bit quantization using Q4-K-M GGUF - **1.80** GB
|
| 146 |
-
- More GGUF quantizations - TBD
|
| 147 |
-
|
| 148 |
-
## Deployment
|
| 149 |
-
|
| 150 |
-
- Support for **Ollama** , **LMStudio** and **llama.cpp** (for local use)
|
| 151 |
-
1. llama.cpp using llama.cpp Python sdk run example:
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
```python
|
| 155 |
-
pip install --upgrade llama-cpp-python
|
| 156 |
-
```
|
| 157 |
-
|
| 158 |
-
```python
|
| 159 |
-
from llama_cpp import Llama
|
| 160 |
-
|
| 161 |
-
llm = Llama(
|
| 162 |
-
model_path="path-to-Jamba-Reasoning-3B-gguf",
|
| 163 |
-
n_ctx=128000,
|
| 164 |
-
n_threads=10, # CPU threads
|
| 165 |
-
n_gpu_layers=-1, # -1 = all layers on GPU (Metal/CUDA if available)
|
| 166 |
-
flash_attn=True,
|
| 167 |
-
)
|
| 168 |
|
| 169 |
-
prompt = """<think>
|
| 170 |
-
You are analyzing a stream of customer support tickets to decide which ones require escalation.
|
| 171 |
-
|
| 172 |
-
Ticket 1: "The new update caused our app to crash whenever users upload a file larger than 50MB."
|
| 173 |
-
Ticket 2: "I can't log in because I forgot my password."
|
| 174 |
-
Ticket 3: "The billing page is missing the new enterprise pricing option."
|
| 175 |
-
|
| 176 |
-
Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.
|
| 177 |
-
</think>"""
|
| 178 |
-
res = llm(
|
| 179 |
-
prompt,
|
| 180 |
-
max_tokens=128,
|
| 181 |
-
temperature=0.6,
|
| 182 |
-
)
|
| 183 |
-
|
| 184 |
-
print(res["choices"][0]["text"])
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
2. llama.cpp using llama.cpp server:
|
| 188 |
-
|
| 189 |
-
```bash
|
| 190 |
-
git clone https://github.com/ggerganov/llama.cpp.git
|
| 191 |
-
cd llama.cpp
|
| 192 |
-
cmake -S . -B build \
|
| 193 |
-
-DGGML_METAL=ON \
|
| 194 |
-
-DGGML_METAL_EMBED_LIBRARY=ON
|
| 195 |
-
cmake --build build --config Release -j
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
Start llama.cpp server with Jamba-Reasoning-3B gguf:
|
| 199 |
-
|
| 200 |
-
```python
|
| 201 |
-
./build/bin/llama-server \
|
| 202 |
-
-m "ai21labs/AI21-Jamba-Reasoning-3B-GGUF" \
|
| 203 |
-
-c 8192 \
|
| 204 |
-
-ngl 99 \
|
| 205 |
-
--host 127.0.0.1 \
|
| 206 |
-
--port 8000
|
| 207 |
-
```
|
| 208 |
-
|
| 209 |
-
Quick sanity test using curl:
|
| 210 |
-
|
| 211 |
-
```bash
|
| 212 |
-
curl -s http://127.0.0.1:8000/v1/completions \
|
| 213 |
-
-H "Content-Type: application/json" \
|
| 214 |
-
-d '{
|
| 215 |
-
"model": "Jamba-Reasoning-3B",
|
| 216 |
-
"prompt": "<think>\nYou are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n</think>",
|
| 217 |
-
"max_tokens": 64,
|
| 218 |
-
"temperature": 0.6
|
| 219 |
-
}' | jq -r '.choices[0].text'
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
|
| 223 |
## Training Details
|
| 224 |
|
| 225 |
We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.
|
|
|
|
| 57 |
|
| 58 |
## Quickstart
|
| 59 |
|
| 60 |
+
### Run the model locally
|
| 61 |
+
|
| 62 |
+
Please reference the GGUF model card here: https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF
|
| 63 |
|
| 64 |
### **Run the model with vLLM**
|
| 65 |
|
| 66 |
For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
|
| 67 |
|
| 68 |
+
```bash
|
| 69 |
pip install vllm>=0.10.2
|
| 70 |
```
|
| 71 |
|
| 72 |
Using vllm in online server mode:
|
| 73 |
|
| 74 |
+
```bash
|
| 75 |
vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
|
| 76 |
```
|
| 77 |
|
|
|
|
| 135 |
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 136 |
print(generated_text)
|
| 137 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
## Training Details
|
| 140 |
|
| 141 |
We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.
|