EllaNeiman commited on
Commit
b5e3dcd
·
verified ·
1 Parent(s): b2c8870

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -89
README.md CHANGED
@@ -57,20 +57,21 @@ Unlike most compact models, Jamba Reasoning 3B supports extremely long contexts.
57
 
58
  ## Quickstart
59
 
60
- **Extended version** – reasoning mode example with `<think>` block and recommended sampling params.
61
- Code Snippet Placeholder
 
62
 
63
  ### **Run the model with vLLM**
64
 
65
  For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
66
 
67
- ```jsx
68
  pip install vllm>=0.10.2
69
  ```
70
 
71
  Using vllm in online server mode:
72
 
73
- ```jsx
74
  vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
75
  ```
76
 
@@ -134,92 +135,7 @@ outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.devi
134
  generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
135
  print(generated_text)
136
  ```
137
-
138
- ## **How to Run This Model Locally**
139
-
140
- You can run Jamba Reasoning 3B on your own machine using popular lightweight runtimes. This makes it possible to experiment with long-context reasoning without relying on cloud infrastructure.
141
-
142
- - **Supported runtimes**: [llama.cpp](https://github.com/ggml-org/llama.cpp), [LM Studio](https://lmstudio.ai/), and [Ollama](https://ollama.com/).
143
- - **Quantizations**: Multiple quantization levels are provided to shrink the model size.
144
- - Full precision FP16 GGUF - **5.96** GB
145
- - 4 bit quantization using Q4-K-M GGUF - **1.80** GB
146
- - More GGUF quantizations - TBD
147
-
148
- ## Deployment
149
-
150
- - Support for **Ollama** , **LMStudio** and **llama.cpp** (for local use)
151
- 1. llama.cpp using llama.cpp Python sdk run example:
152
-
153
-
154
- ```python
155
- pip install --upgrade llama-cpp-python
156
- ```
157
-
158
- ```python
159
- from llama_cpp import Llama
160
-
161
- llm = Llama(
162
- model_path="path-to-Jamba-Reasoning-3B-gguf",
163
- n_ctx=128000,
164
- n_threads=10, # CPU threads
165
- n_gpu_layers=-1, # -1 = all layers on GPU (Metal/CUDA if available)
166
- flash_attn=True,
167
- )
168
 
169
- prompt = """<think>
170
- You are analyzing a stream of customer support tickets to decide which ones require escalation.
171
-
172
- Ticket 1: "The new update caused our app to crash whenever users upload a file larger than 50MB."
173
- Ticket 2: "I can't log in because I forgot my password."
174
- Ticket 3: "The billing page is missing the new enterprise pricing option."
175
-
176
- Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.
177
- </think>"""
178
- res = llm(
179
- prompt,
180
- max_tokens=128,
181
- temperature=0.6,
182
- )
183
-
184
- print(res["choices"][0]["text"])
185
- ```
186
-
187
- 2. llama.cpp using llama.cpp server:
188
-
189
- ```bash
190
- git clone https://github.com/ggerganov/llama.cpp.git
191
- cd llama.cpp
192
- cmake -S . -B build \
193
- -DGGML_METAL=ON \
194
- -DGGML_METAL_EMBED_LIBRARY=ON
195
- cmake --build build --config Release -j
196
- ```
197
-
198
- Start llama.cpp server with Jamba-Reasoning-3B gguf:
199
-
200
- ```python
201
- ./build/bin/llama-server \
202
- -m "ai21labs/AI21-Jamba-Reasoning-3B-GGUF" \
203
- -c 8192 \
204
- -ngl 99 \
205
- --host 127.0.0.1 \
206
- --port 8000
207
- ```
208
-
209
- Quick sanity test using curl:
210
-
211
- ```bash
212
- curl -s http://127.0.0.1:8000/v1/completions \
213
- -H "Content-Type: application/json" \
214
- -d '{
215
- "model": "Jamba-Reasoning-3B",
216
- "prompt": "<think>\nYou are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n</think>",
217
- "max_tokens": 64,
218
- "temperature": 0.6
219
- }' | jq -r '.choices[0].text'
220
- ```
221
-
222
-
223
  ## Training Details
224
 
225
  We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.
 
57
 
58
  ## Quickstart
59
 
60
+ ### Run the model locally
61
+
62
+ Please reference the GGUF model card here: https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF
63
 
64
  ### **Run the model with vLLM**
65
 
66
  For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
67
 
68
+ ```bash
69
  pip install vllm>=0.10.2
70
  ```
71
 
72
  Using vllm in online server mode:
73
 
74
+ ```bash
75
  vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
76
  ```
77
 
 
135
  generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
136
  print(generated_text)
137
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  ## Training Details
140
 
141
  We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.