ai21labs
/

AI21-Jamba-Reasoning-3B

@@ -57,20 +57,21 @@ Unlike most compact models, Jamba Reasoning 3B supports extremely long contexts.
 ## Quickstart
-**Extended version** – reasoning mode example with `<think>` block and recommended sampling params.
-Code Snippet Placeholder
 ### **Run the model with vLLM**
 For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
-```jsx
 pip install vllm>=0.10.2
 ```
 Using vllm in online server mode:
-```jsx
 vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
 ```
@@ -134,92 +135,7 @@ outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.devi
 generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(generated_text)
 ```
-## **How to Run This Model Locally**
-You can run Jamba Reasoning 3B on your own machine using popular lightweight runtimes. This makes it possible to experiment with long-context reasoning without relying on cloud infrastructure.
-- **Supported runtimes**: [llama.cpp](https://github.com/ggml-org/llama.cpp), [LM Studio](https://lmstudio.ai/), and [Ollama](https://ollama.com/).
-- **Quantizations**: Multiple quantization levels  are provided to shrink the model size.
-    - Full precision FP16 GGUF - **5.96** GB
-    - 4 bit quantization using Q4-K-M GGUF - **1.80** GB
-    - More GGUF quantizations - TBD
-## Deployment
-- Support for **Ollama** , **LMStudio** and **llama.cpp** (for local use)
-    1. llama.cpp using llama.cpp Python sdk run example:
-        ```python
-        pip install --upgrade llama-cpp-python
-        ```
-        ```python
-        from llama_cpp import Llama
-        llm = Llama(
-            model_path="path-to-Jamba-Reasoning-3B-gguf",
-            n_ctx=128000,
-            n_threads=10,        # CPU threads
-            n_gpu_layers=-1,     # -1 = all layers on GPU (Metal/CUDA if available)
-            flash_attn=True,
-        )
-        prompt =  """<think>
-        You are analyzing a stream of customer support tickets to decide which ones require escalation.
-        Ticket 1: "The new update caused our app to crash whenever users upload a file larger than 50MB."
-        Ticket 2: "I can't log in because I forgot my password."
-        Ticket 3: "The billing page is missing the new enterprise pricing option."
-        Classify each ticket as 'Critical', 'Medium', or 'Low' priority and explain your reasoning.
-        </think>"""
-        res = llm(
-            prompt,
-            max_tokens=128,
-            temperature=0.6,
-        )
-        print(res["choices"][0]["text"])
-        ```
-        2. llama.cpp using llama.cpp server:
-        ```bash
-        git clone https://github.com/ggerganov/llama.cpp.git
-        cd llama.cpp
-        cmake -S . -B build \
-          -DGGML_METAL=ON \
-          -DGGML_METAL_EMBED_LIBRARY=ON
-        cmake --build build --config Release -j
-        ```
-        Start llama.cpp server with Jamba-Reasoning-3B gguf:
-        ```python
-        ./build/bin/llama-server \
-          -m "ai21labs/AI21-Jamba-Reasoning-3B-GGUF" \
-          -c 8192 \
-          -ngl 99 \
-          --host 127.0.0.1 \
-          --port 8000
-        ```
-        Quick sanity test using curl:
-        ```bash
-        curl -s http://127.0.0.1:8000/v1/completions \
-          -H "Content-Type: application/json" \
-          -d '{
-            "model": "Jamba-Reasoning-3B",
-            "prompt": "<think>\nYou are analyzing customer support tickets to decide which need escalation.\nTicket 1: 'App crashes when uploading files >50MB.'\nTicket 2: 'Forgot password, can’t log in.'\nTicket 3: 'Billing page missing enterprise pricing.'\nClassify each ticket as Critical, Medium, or Low and explain your reasoning.\n</think>",
-            "max_tokens": 64,
-            "temperature": 0.6
-          }' | jq -r '.choices[0].text'
-        ```
 ## Training Details
 We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.

 ## Quickstart
+### Run the model locally
+Please reference the GGUF model card here: https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B-GGUF
 ### **Run the model with vLLM**
 For best results, we recommend using vLLM version 0.10.2 or higher and enabling `--mamba-ssm-cache-dtype=float32`
+```bash
 pip install vllm>=0.10.2
 ```
 Using vllm in online server mode:
+```bash
 vllm serve "ai21labs/AI21-Jamba-Reasoning-3B" --mamba-ssm-cache-dtype float32 --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes
 ```
 generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(generated_text)
 ```
 ## Training Details
 We trained the model in multiple stages, each designed to strengthen reasoning and long-context performance. The process began with large-scale pre-training on a diverse corpus of natural documents. We then mid-trained on ~0.5T tokens of math and code, while extending the context length to 32K tokens. During this stage we also applied a [Mamba-specific long-context method](https://arxiv.org/abs/2507.02782), which we found to significantly improve long-context abilities.