smol-tools-4b-16k — Long-Context Agentic Tool-Use Model
A 4B parameter model fine-tuned for reliable tool calling with 16K context support. Handles multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 4x larger than the original smol-tools-4b.
Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 7,557 examples including multi-turn tool-use conversations up to 16K tokens.
Architecture:
Qwen3_5ForCausalLM(text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.
Need more context? See smol-tools-4b-32k for 32K context support. Need less? See smol-tools-4b for the highest-accuracy 4K variant.
Available Formats
| Format | File | Size | Tool F1 | Use Case |
|---|---|---|---|---|
| BF16 safetensors | model.safetensors |
9.7 GB | 0.948 | GPU inference with transformers / vLLM |
| Q8_0 GGUF | smol-tools-4b-16k-q8_0.gguf |
4.9 GB | 0.923 | Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs |
| Q4_K_M GGUF | smol-tools-4b-16k-q4_k_m.gguf |
2.9 GB | 0.928 | Edge deployment — Jetson Orin Nano, phones, RPi 5 |
All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.
Why 16K Context?
The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:
- Multi-turn conversations — 5-10 rounds of tool calls and results accumulating in context
- Long tool outputs — database queries, file reads, and web pages that consume thousands of tokens
- Complex planning — reasoning over prior results to decide next steps
This model was specifically trained on multi-turn tool-use data (up to 16K tokens) so it maintains tool-calling accuracy across long conversations, not just short single-turn queries.
Results (200-example held-out eval)
| Metric | smol-tools-4b (4K) | smol-tools-4b-16k | Delta |
|---|---|---|---|
| Tool Selection F1 | 0.955 | 0.948 | -0.7% |
| Tool Precision | 0.955 | 0.948 | -0.7% |
| Tool Recall | 0.980 | 0.975 | -0.5% |
| JSON Validity | 100% | 100% | — |
| Argument Correctness | 100% | 100% | — |
| No-Tool Accuracy | 100% | 100% | — |
| Max Context | 4,096 | 16,384 | 4x |
Tool-calling accuracy is near-identical to the 4K model while supporting 4x the context length.
Per-Scenario Breakdown
| Scenario | F1 | Count | Description |
|---|---|---|---|
| multi_tool_parallel | 1.000 | 18 | Multiple independent tool calls |
| multi_tool_sequential | 1.000 | 36 | Chained tool calls with dependencies |
| no_tool_needed | 1.000 | 18 | Questions answerable without tools |
| error_recovery | 1.000 | 18 | Handling malformed inputs or missing data |
| single_tool | 0.962 | 53 | One tool call needed |
| reasoning_heavy | 0.876 | 35 | Complex reasoning before tool selection |
| complex_multi_step | 0.818 | 22 | Multi-step workflows with planning |
Capabilities
- 16K context window — handles multi-turn agent conversations with accumulated tool results
- Tool selection: Picks the right tool(s) from a provided set with 94.8% F1
- Structured output: Produces valid
<tool_call>{"name": "...", "arguments": {...}}</tool_call>JSON — 100% validity - Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
- Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
- Error recovery: Perfect F1 on error recovery scenarios (malformed inputs, missing data)
Available Tools (training set)
The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:
web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"enfuse/smol-tools-4b-16k",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-16k", trust_remote_code=True)
tools = [
{"type": "function", "function": {
"name": "web_search",
"description": "Search the web for information",
"parameters": {"type": "object", "properties": {
"query": {"type": "string"}
}, "required": ["query"]}
}}
]
messages = [
{"role": "system", "content": "You are a helpful assistant with access to tools."},
{"role": "user", "content": "What's the latest news about SpaceX?"},
]
prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))
With vLLM (faster)
from vllm import LLM, SamplingParams
llm = LLM(model="enfuse/smol-tools-4b-16k", dtype="bfloat16", max_model_len=16384, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)
With llama.cpp (GGUF, edge devices)
# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-16k smol-tools-4b-16k-q4_k_m.gguf --local-dir .
# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080
# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="smol-tools-4b-16k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])
Example Outputs (real model outputs, not hand-written)
These are actual outputs from the model at temperature=0.1:
Example 1 — Single tool call (input: "What's the weather like in Tokyo right now?")
<think>
The user is asking about the current weather in Tokyo. I have a weather tool
available that can provide this information. Let me use the get_weather function
with Tokyo as the location. Since Tokyo is in Japan, I'll use Celsius as the
default unit.
</think>
<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo", "units": "celsius"}}
</tool_call>
Example 2 — No tool needed (input: "What is the capital of France?")
<think>
The user is asking a simple factual question about the capital of France. This
is general knowledge that I can answer directly without needing to use any of the
available tools (get_weather, web_search, or calculate).
</think>
The capital of France is **Paris**.
This is a well-known geographical fact that doesn't require any tools to verify.
Paris has been the capital of France for centuries and is one of the world's most
famous cities, known for landmarks like the Eiffel Tower, the Louvre Museum, and
Notre-Dame Cathedral.
Example 3 — Multiple parallel tool calls (input: "I'm planning a trip to Paris next week. What's the weather forecast, and can you calculate how many euros I'd get for 500 USD at a rate of 0.92?")
<think>
The user is asking for two things:
1. Weather forecast for Paris next week
2. Currency conversion calculation (500 USD to euros at 0.92 rate)
For the weather, I can use the get_weather function. For the calculation, I can
use the calculate function with the expression "500 * 0.92" to convert USD to
euros. Let me make both function calls.
</think>
I'll help you with both the weather information and the currency calculation for
your Paris trip.
<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris", "units": "celsius"}}
</tool_call>
<tool_call>
{"name": "calculate", "arguments": {"expression": "500 * 0.92"}}
</tool_call>
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled |
| Method | LoRA (rank 64, alpha 128) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training examples | 7,557 (6,855 short-context + 702 multi-turn long-context) |
| Epochs | 3 |
| Batch size | 2 (× 16 gradient accumulation = effective 32) |
| Learning rate | 1e-4 (cosine schedule) |
| Max sequence length | 16,384 |
| Eval loss | 0.134 |
| Token accuracy | 95.3% |
| Training time | ~14.4 hours on 1× NVIDIA H200 |
| Framework | TRL SFTTrainer + PEFT |
Data Pipeline
- Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
- Multi-turn long-context data (702 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
- Combined: 7,557 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤16K tokens
smol-tools Family
All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:
| Model | Context | Tool F1 | JSON Valid | No-Tool Acc | Parameters | HF Repo |
|---|---|---|---|---|---|---|
| smol-tools-4b | 4K | 0.955 | 100% | 100% | Rank 32, α=64 | enfuse/smol-tools-4b |
| smol-tools-4b-16k | 16K | 0.948 | 100% | 100% | Rank 64, α=128 | this repo |
| smol-tools-4b-32k | 32K | 0.940 | 100% | 100% | Rank 64, α=128 | enfuse/smol-tools-4b-32k |
How to choose:
- 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
- 16K (this model): Multi-turn conversations (5-10 rounds), moderate tool outputs — best balance of accuracy and context length
- 32K: Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats
When to Use This Model
- You're building an agent that needs multi-turn tool conversations — research assistants, code generators, data analysis pipelines
- Your tool outputs are long — database query results, file contents, web scrapes that push context beyond 4K
- You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
- You want structured output you can trust at long context — 100% JSON validity even with 16K of accumulated context
When NOT to Use This Model
- If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
- If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
- If your conversations routinely exceed 16K tokens, use smol-tools-4b-32k (32K context)
Limitations
- complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
- Trained on synthetic data only — real-world tool-use patterns may differ
- The 16K training data was generated by a 27B teacher model; very long conversations (>12K tokens) may show quality degradation compared to shorter ones
- Inherits Qwen3.5-4B base model limitations (knowledge cutoff)
Quantization Results (200-example eval)
| Format | Size | Tool F1 | Precision | Recall | JSON | Args | No-Tool |
|---|---|---|---|---|---|---|---|
| BF16 | 9.7 GB | 0.948 | 0.948 | 0.975 | 100% | 100% | 100% |
| Q8_0 | 4.9 GB | 0.923 | 0.922 | 0.955 | 100% | 100% | 100% |
| Q4_K_M | 2.9 GB | 0.928 | 0.927 | 0.950 | 100% | 100% | 100% |
All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.
Hardware
- Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~14.4 hours
- Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
- Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
- Inference (Q8_0 GGUF, 16K context): Any device with ≥8 GB RAM — Jetson Orin NX/AGX, consumer GPUs
- Inference (Q4_K_M GGUF, 16K context): Any device with ≥5 GB RAM — Jetson Orin Nano, phones, RPi 5
- Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM
Attribution
- Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
- Training framework: TRL + PEFT by HuggingFace
- Inference: vLLM
- Downloads last month
- 177
4-bit
8-bit
Model tree for enfuse/smol-tools-4b-16k
Base model
Qwen/Qwen3.5-4B-BaseEvaluation results
- Tool Selection F1self-reported0.948
- JSON Validityself-reported1.000
- No-Tool Accuracyself-reported1.000