smol-tools-4b-16k — Long-Context Agentic Tool-Use Model

A 4B parameter model fine-tuned for reliable tool calling with 16K context support. Handles multi-turn tool-use conversations, long document analysis with tool calls, and complex multi-step agent workflows — all within a single context window 4x larger than the original smol-tools-4b.

Built on Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled, trained with LoRA on 7,557 examples including multi-turn tool-use conversations up to 16K tokens.

Architecture: Qwen3_5ForCausalLM (text-only, 32 layers, hybrid attention — 24 linear + 8 full-attention). Qwen3.5's efficient attention makes long-context inference memory-friendly.

Need more context? See smol-tools-4b-32k for 32K context support. Need less? See smol-tools-4b for the highest-accuracy 4K variant.

Available Formats

Format File Size Tool F1 Use Case
BF16 safetensors model.safetensors 9.7 GB 0.948 GPU inference with transformers / vLLM
Q8_0 GGUF smol-tools-4b-16k-q8_0.gguf 4.9 GB 0.923 Near-lossless — Jetson Orin NX/AGX, 8GB+ GPUs
Q4_K_M GGUF smol-tools-4b-16k-q4_k_m.gguf 2.9 GB 0.928 Edge deployment — Jetson Orin Nano, phones, RPi 5

All formats maintain 100% JSON validity, 100% argument correctness, and 100% no-tool accuracy. GGUF files run with llama.cpp, ollama, or llama-cpp-python.

Why 16K Context?

The original smol-tools-4b (4K context) works well for single-turn tool calls. But real agent workflows often involve:

  • Multi-turn conversations — 5-10 rounds of tool calls and results accumulating in context
  • Long tool outputs — database queries, file reads, and web pages that consume thousands of tokens
  • Complex planning — reasoning over prior results to decide next steps

This model was specifically trained on multi-turn tool-use data (up to 16K tokens) so it maintains tool-calling accuracy across long conversations, not just short single-turn queries.

Results (200-example held-out eval)

Metric smol-tools-4b (4K) smol-tools-4b-16k Delta
Tool Selection F1 0.955 0.948 -0.7%
Tool Precision 0.955 0.948 -0.7%
Tool Recall 0.980 0.975 -0.5%
JSON Validity 100% 100%
Argument Correctness 100% 100%
No-Tool Accuracy 100% 100%
Max Context 4,096 16,384 4x

Tool-calling accuracy is near-identical to the 4K model while supporting 4x the context length.

Per-Scenario Breakdown

Scenario F1 Count Description
multi_tool_parallel 1.000 18 Multiple independent tool calls
multi_tool_sequential 1.000 36 Chained tool calls with dependencies
no_tool_needed 1.000 18 Questions answerable without tools
error_recovery 1.000 18 Handling malformed inputs or missing data
single_tool 0.962 53 One tool call needed
reasoning_heavy 0.876 35 Complex reasoning before tool selection
complex_multi_step 0.818 22 Multi-step workflows with planning

Capabilities

  • 16K context window — handles multi-turn agent conversations with accumulated tool results
  • Tool selection: Picks the right tool(s) from a provided set with 94.8% F1
  • Structured output: Produces valid <tool_call>{"name": "...", "arguments": {...}}</tool_call> JSON — 100% validity
  • Tool refusal: Correctly answers directly when no tool is needed — 100% accuracy
  • Multi-tool: Handles parallel and sequential multi-tool scenarios perfectly
  • Error recovery: Perfect F1 on error recovery scenarios (malformed inputs, missing data)

Available Tools (training set)

The model was trained with these 15 tools but generalizes to new tool schemas provided at inference:

web_search, get_webpage, execute_python, read_file, write_file, list_directory, send_email, get_current_datetime, calculate, translate, get_weather, create_calendar_event, database_query, http_request, shell_command

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "enfuse/smol-tools-4b-16k",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("enfuse/smol-tools-4b-16k", trust_remote_code=True)

tools = [
    {"type": "function", "function": {
        "name": "web_search",
        "description": "Search the web for information",
        "parameters": {"type": "object", "properties": {
            "query": {"type": "string"}
        }, "required": ["query"]}
    }}
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to tools."},
    {"role": "user", "content": "What's the latest news about SpaceX?"},
]

prompt = tokenizer.apply_chat_template(messages, tools=tools, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False))

With vLLM (faster)

from vllm import LLM, SamplingParams

llm = LLM(model="enfuse/smol-tools-4b-16k", dtype="bfloat16", max_model_len=16384, enforce_eager=True)
sampling = SamplingParams(max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
outputs = llm.generate([prompt], sampling)

With llama.cpp (GGUF, edge devices)

# Download the Q4_K_M GGUF (2.9 GB) for edge deployment
huggingface-cli download enfuse/smol-tools-4b-16k smol-tools-4b-16k-q4_k_m.gguf --local-dir .

# Run with llama-server (OpenAI-compatible API)
llama-server -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 --port 8080

# Or with llama-cli for one-shot inference
llama-cli -m smol-tools-4b-16k-q4_k_m.gguf -c 4096 -ngl 99 -p "<your prompt>"
# Or with llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="smol-tools-4b-16k-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=-1)
output = llm(prompt, max_tokens=2048, temperature=0.1, stop=["<|im_end|>"])
print(output["choices"][0]["text"])

Example Outputs (real model outputs, not hand-written)

These are actual outputs from the model at temperature=0.1:

Example 1 — Single tool call (input: "What's the weather like in Tokyo right now?")

<think>
The user is asking about the current weather in Tokyo. I have a weather tool
available that can provide this information. Let me use the get_weather function
with Tokyo as the location. Since Tokyo is in Japan, I'll use Celsius as the
default unit.
</think>

<tool_call>
{"name": "get_weather", "arguments": {"location": "Tokyo", "units": "celsius"}}
</tool_call>

Example 2 — No tool needed (input: "What is the capital of France?")

<think>
The user is asking a simple factual question about the capital of France. This
is general knowledge that I can answer directly without needing to use any of the
available tools (get_weather, web_search, or calculate).
</think>

The capital of France is **Paris**.

This is a well-known geographical fact that doesn't require any tools to verify.
Paris has been the capital of France for centuries and is one of the world's most
famous cities, known for landmarks like the Eiffel Tower, the Louvre Museum, and
Notre-Dame Cathedral.

Example 3 — Multiple parallel tool calls (input: "I'm planning a trip to Paris next week. What's the weather forecast, and can you calculate how many euros I'd get for 500 USD at a rate of 0.92?")

<think>
The user is asking for two things:
1. Weather forecast for Paris next week
2. Currency conversion calculation (500 USD to euros at 0.92 rate)

For the weather, I can use the get_weather function. For the calculation, I can
use the calculate function with the expression "500 * 0.92" to convert USD to
euros. Let me make both function calls.
</think>

I'll help you with both the weather information and the currency calculation for
your Paris trip.

<tool_call>
{"name": "get_weather", "arguments": {"location": "Paris", "units": "celsius"}}
</tool_call>
<tool_call>
{"name": "calculate", "arguments": {"expression": "500 * 0.92"}}
</tool_call>

Training Details

Parameter Value
Base model Qwen3.5-4B-Claude-4.6-Opus-Reasoning-Distilled
Method LoRA (rank 64, alpha 128)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training examples 7,557 (6,855 short-context + 702 multi-turn long-context)
Epochs 3
Batch size 2 (× 16 gradient accumulation = effective 32)
Learning rate 1e-4 (cosine schedule)
Max sequence length 16,384
Eval loss 0.134
Token accuracy 95.3%
Training time ~14.4 hours on 1× NVIDIA H200
Framework TRL SFTTrainer + PEFT

Data Pipeline

  1. Short-context data (6,855 examples): Quality-filtered synthetic tool-use conversations from smol-tools-4b training, covering all 7 scenario types at up to 4K tokens
  2. Multi-turn long-context data (702 examples): New synthetic conversations generated by Qwen3.5-27B teacher at 64K context, featuring 8-16 rounds of tool calls with realistic tool outputs (database results, file contents, web pages). Cleaned to remove malformed tool calls and ensure proper conversation endings
  3. Combined: 7,557 examples with a mix of short single-turn and long multi-turn conversations, filtered to ≤16K tokens

smol-tools Family

All models share the same base architecture, tool schema, and output format. Choose based on your context length needs:

Model Context Tool F1 JSON Valid No-Tool Acc Parameters HF Repo
smol-tools-4b 4K 0.955 100% 100% Rank 32, α=64 enfuse/smol-tools-4b
smol-tools-4b-16k 16K 0.948 100% 100% Rank 64, α=128 this repo
smol-tools-4b-32k 32K 0.940 100% 100% Rank 64, α=128 enfuse/smol-tools-4b-32k

How to choose:

  • 4K: Single-turn tool calls, short tool outputs — highest accuracy, lowest memory — also available in GGUF quantized formats
  • 16K (this model): Multi-turn conversations (5-10 rounds), moderate tool outputs — best balance of accuracy and context length
  • 32K: Extended agent sessions (10-20 rounds), large tool outputs — also available in GGUF quantized formats

When to Use This Model

  • You're building an agent that needs multi-turn tool conversations — research assistants, code generators, data analysis pipelines
  • Your tool outputs are long — database query results, file contents, web scrapes that push context beyond 4K
  • You need a small, fast model that can handle extended agent sessions without losing tool-calling accuracy
  • You want structured output you can trust at long context — 100% JSON validity even with 16K of accumulated context

When NOT to Use This Model

  • If all your tool calls are single-turn with short outputs, use smol-tools-4b (4K) — it's slightly more accurate and uses less memory
  • If you need GPT-4-level complex multi-step planning (our weakest category at F1=0.818), use a bigger model
  • If your conversations routinely exceed 16K tokens, use smol-tools-4b-32k (32K context)

Limitations

  • complex_multi_step scenarios (F1=0.818) remain the weakest — the model sometimes struggles with multi-step planning involving 3+ chained tools
  • Trained on synthetic data only — real-world tool-use patterns may differ
  • The 16K training data was generated by a 27B teacher model; very long conversations (>12K tokens) may show quality degradation compared to shorter ones
  • Inherits Qwen3.5-4B base model limitations (knowledge cutoff)

Quantization Results (200-example eval)

Format Size Tool F1 Precision Recall JSON Args No-Tool
BF16 9.7 GB 0.948 0.948 0.975 100% 100% 100%
Q8_0 4.9 GB 0.923 0.922 0.955 100% 100% 100%
Q4_K_M 2.9 GB 0.928 0.927 0.950 100% 100% 100%

All formats preserve perfect JSON validity and argument correctness. The Q4_K_M quantization (3.4x smaller) retains 98% of the BF16 model's tool-calling accuracy.

Hardware

  • Training: 1× NVIDIA H200 NVL (141 GB HBM3e), ~14.4 hours
  • Inference (BF16, 16K context): Any GPU with ≥16 GB VRAM
  • Inference (BF16, 4K context): Any GPU with ≥10 GB VRAM
  • Inference (Q8_0 GGUF, 16K context): Any device with ≥8 GB RAM — Jetson Orin NX/AGX, consumer GPUs
  • Inference (Q4_K_M GGUF, 16K context): Any device with ≥5 GB RAM — Jetson Orin Nano, phones, RPi 5
  • Inference (Q4_K_M GGUF, 4K context): Any device with ≥4 GB RAM

Attribution

Downloads last month
177
GGUF
Model size
5B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enfuse/smol-tools-4b-16k

Evaluation results