smol-tools-4b-16k — Long-Context Agentic Tool-Use Model

A 4B parameter model fine-tuned for reliable tool calling with 16K context support. Handles multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 4x larger than the original smol-tools-4b.

Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 7,557 examples including multi-turn tool-use conversations up to 16K tokens.

Architecture: Qwen3_5ForCausalLM (text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.

Need more context? See smol-tools-4b-32k for 32K context support. Need less? See smol-tools-4b for the highest-accuracy 4K variant.

Available Formats

Format	File	Size	Tool F1	Use Case
BF16 safetensors	`model.safetensors`	9.7 GB	0.948	GPU inference with transformers / vLLM
Q8_0 GGUF	`smol-tools-4b-16k-q8_0.gguf`	4.9 GB	0.923	Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs
Q4_K_M GGUF	`smol-tools-4b-16k-q4_k_m.gguf`	2.9 GB	0.928	Edge deployment — Jetson Orin Nano, phones, RPi 5

All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.

Why 16K Context?

The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:

Multi-turn conversations — 5-10 rounds of tool calls and results accumulating in context
Long tool outputs — database queries, file reads, and web pages that consume thousands of tokens
Complex planning — reasoning over prior results to decide next steps

This model was specifically trained on multi-turn tool-use data (up to 16K tokens) so it maintains tool-calling accuracy across long conversations, not just short single-turn queries.

Results (200-example held-out eval)

Metric	smol-tools-4b (4K)	smol-tools-4b-16k	Delta
Tool Selection F1	0.955	0.948	-0.7%
Tool Precision	0.955	0.948	-0.7%
Tool Recall	0.980	0.975	-0.5%
JSON Validity	100%	100%	—
Argument Correctness	100%	100%	—
No-Tool Accuracy	100%	100%	—
Max Context	4,096	16,384	4x

Tool-calling accuracy is near-identical to the 4K model while supporting 4x the context length.

Per-Scenario Breakdown

Scenario	F1	Count	Description
multi_tool_parallel	1.000	18	Multiple independent tool calls
multi_tool_sequential	1.000	36	Chained tool calls with dependencies
no_tool_needed	1.000	18	Questions answerable without tools
error_recovery	1.000	18	Handling malformed inputs or missing data
single_tool	0.962	53	One tool call needed
reasoning_heavy	0.876	35	Complex reasoning before tool selection
complex_multi_step	0.818	22	Multi-step workflows with planning

Capabilities

16K context window — handles multi-turn agent conversations with accumulated tool results
Tool selection: Picks the right tool(s) from a provided set with 94.8% F1
Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
Error recovery: Perfect F1 on error recovery scenarios (malformed inputs, missing data)

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b-16k",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-16k", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b-16k", dtype="bfloat16", max_model_len=16384, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

With llama.cpp (GGUF, edge devices)

# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-16k smol-tools-4b-16k-q4_k_m.gguf --local-dir .

# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080

# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"

# Or with llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="smol-tools-4b-16k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])

Example Outputs (real model outputs, not hand-written)

These are actual outputs from the model at temperature=0.1:

Example 1 — Single tool call (input: "What's the weather like in Tokyo right now?")

<think>
The user is asking about the current weather in Tokyo. I have a weather tool
available that can provide this information. Let me use the get_weather function
with Tokyo as the location. Since Tokyo is in Japan, I'll use Celsius as the
default unit.
</think>

<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo", "units": "celsius"}}
</tool_call>

Example 2 — No tool needed (input: "What is the capital of France?")

<think>
The user is asking a simple factual question about the capital of France. This
is general knowledge that I can answer directly without needing to use any of the
available tools (get_weather, web_search, or calculate).
</think>

The capital of France is **Paris**.

This is a well-known geographical fact that doesn't require any tools to verify.
Paris has been the capital of France for centuries and is one of the world's most
famous cities, known for landmarks like the Eiffel Tower, the Louvre Museum, and
Notre-Dame Cathedral.

Example 3 — Multiple parallel tool calls (input: "I'm planning a trip to Paris next week. What's the weather forecast, and can you calculate how many euros I'd get for 500 USD at a rate of 0.92?")

<think>
The user is asking for two things:
1. Weather forecast for Paris next week
2. Currency conversion calculation (500 USD to euros at 0.92 rate)

For the weather, I can use the get_weather function. For the calculation, I can
use the calculate function with the expression "500 * 0.92" to convert USD to
euros. Let me make both function calls.
</think>

I'll help you with both the weather information and the currency calculation for
your Paris trip.

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris", "units": "celsius"}}
</tool_call>
<tool_call>
{"name": "calculate", "arguments": {"expression": "500 * 0.92"}}
</tool_call>

Training Details

Parameter	Value
Base model	Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method	LoRA (rank 64, alpha 128)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples	7,557 (6,855 short-context + 702 multi-turn long-context)
Epochs	3
Batch size	2 (× 16 gradient accumulation = effective 32)
Learning rate	1e-4 (cosine schedule)
Max sequence length	16,384
Eval loss	0.134
Token accuracy	95.3%
Training time	~14.4 hours on 1× NVIDIA H200
Framework	TRL SFTTrainer + PEFT

Data Pipeline

Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
Multi-turn long-context data (702 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
Combined: 7,557 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤16K tokens

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model	Context	Tool F1	JSON Valid	No-Tool Acc	Parameters	HF Repo
smol-tools-4b	4K	0.955	100%	100%	Rank 32, α=64	enfuse/smol-tools-4b
smol-tools-4b-16k	16K	0.948	100%	100%	Rank 64, α=128	this repo
smol-tools-4b-32k	32K	0.940	100%	100%	Rank 64, α=128	enfuse/smol-tools-4b-32k

How to choose:

4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
16K (this model): Multi-turn conversations (5-10 rounds), moderate tool outputs — best balance of accuracy and context length
32K: Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

You're building an agent that needs multi-turn tool conversations — research assistants, code generators, data analysis pipelines
Your tool outputs are long — database query results, file contents, web scrapes that push context beyond 4K
You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
You want structured output you can trust at long context — 100% JSON validity even with 16K of accumulated context

When NOT to Use This Model

If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
If your conversations routinely exceed 16K tokens, use smol-tools-4b-32k (32K context)

Limitations

complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
Trained on synthetic data only — real-world tool-use patterns may differ
The 16K training data was generated by a 27B teacher model; very long conversations (>12K tokens) may show quality degradation compared to shorter ones
Inherits Qwen3.5-4B base model limitations (knowledge cutoff)

Quantization Results (200-example eval)

Format	Size	Tool F1	Precision	Recall	JSON	Args	No-Tool
BF16	9.7 GB	0.948	0.948	0.975	100%	100%	100%
Q8_0	4.9 GB	0.923	0.922	0.955	100%	100%	100%
Q4_K_M	2.9 GB	0.928	0.927	0.950	100%	100%	100%

All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.

Hardware

Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~14.4 hours
Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
Inference (Q8_0 GGUF, 16K context): Any device with ≥8 GB RAM — Jetson Orin NX/AGX, consumer GPUs
Inference (Q4_K_M GGUF, 16K context): Any device with ≥5 GB RAM — Jetson Orin Nano, phones, RPi 5
Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM

Attribution

Base model: Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled by Jackrong
Training framework: TRL + PEFT by HuggingFace
Inference: vLLM

Downloads last month: 177

GGUF

Model size

5B params

Architecture

qwen35

Hardware compatibility

4-bit

8-bit

Model tree for enfuse/smol-tools-4b-16k

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

Jackrong/Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled

Adapter

(8)

this model

Evaluation results

Tool Selection F1
self-reported

0.948
JSON Validity
self-reported

1.000
No-Tool Accuracy
self-reported

1.000