nl-router-klein

Small Dutch intent classifier — 46M parameters, 60-way MASSIVE intent routing for local-first LLM workflows.

nl-router-klein ("klein" = small in Dutch) is a single-label classifier that reads a short Dutch utterance and predicts one of the 60 canonical MASSIVE intents. It's the fast, deterministic "decide what the user wants" step that sits before an LLM call — routing cheap intents to local tools, picking the right RAG index, selecting the right agent skill, or escalating ambiguous utterances to a full LLM.

Before you use this model, read the Limitations section carefully. The 0.856 MASSIVE benchmark reflects performance on MASSIVE-style voice-assistant utterances in the 60-intent schema; real-world queries that don't fit those 60 intents will be mis-routed unless you use a confidence threshold (see Confidence-based routing below).

It is the routing tier of the LokaalHub Dutch LLM-workflow family.

At a glance

Base model DTAI-KULeuven/robbertje-1-gb-bort (RobBERTje-BORT, distilled from RobBERT)
Parameters 46M
Disk size 178.0 MB (fp32)
Language Dutch (nl)
Task 60-way intent classification (MASSIVE schema)
License Apache-2.0
Training data MASSIVE nl-NL (CC-BY-4.0)
MASSIVE nl-NL test intent acc 0.856 (3-seed mean)
Macro-F1 0.813 (3-seed mean)
Scenario accuracy (18-way) 0.900
ECE (calibration) 0.033
CPU throughput ~179 utterances/sec (single M1 P-core, fp32, batch 64, torch.set_num_threads(1))

What to use this for

Any Dutch LLM workflow where a tiny, fast, deterministic "decide what to do" step makes the whole pipeline better, cheaper, and more auditable.

Primary use cases

  1. Local-vs-cloud gating. Route cheap intents (alarm_set, weather_query, calendar_query) to hardcoded local tools; escalate open-ended intents (qa_factoid, general_quirky) to a full local or cloud LLM. A single inference on this 46M model takes ~2 ms on CPU vs. 500–5000 ms for even a local 7B LLM.

  2. Tool / skill selection in agent frameworks. Drop-in replacement for LLM-based function-calling when the tool taxonomy matches MASSIVE's intents. Deterministic, auditable, and ~100× faster than a full LLM round-trip.

  3. RAG index selection. Map intent → knowledge source. cooking_recipe → cookbook index, news_query → news index, qa_factoid → encyclopedia index, etc.

  4. Confidence-based LLM escalation. Use max(softmax) < threshold to fall back to an LLM on out-of-distribution queries. Calibration (ECE = 0.033) is tracked so the threshold is meaningful.

  5. Deterministic routing logs. Unlike LLM-based routing, every decision is reproducible and auditable. Critical for governance-sensitive Dutch deployments (healthcare, legal, HR, government).

What makes it suitable for these

  • Local and private. Runs fully on-device. No data leaves the laptop / VM / cluster. Critical for GDPR-sensitive Dutch workloads.
  • Fast. ~179 utt/sec on a single M1 CPU core. You can route every incoming user message without perceivable latency.
  • Small. 178.0 MB fits alongside an agent runtime, a local vector DB, and a lightweight generation model.
  • Deterministic. Same input → same output → auditable routing logs.
  • Cheap. Zero marginal cost per inference.

Read the Limitations section before committing — this is the klein (fast) tier, trained only on the 60 MASSIVE intents. Anything outside that taxonomy will be mis-routed unless you gate with a confidence threshold.

Quick start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("LokaalHub/nl-router-klein")
model = AutoModelForSequenceClassification.from_pretrained("LokaalHub/nl-router-klein")

utterance = "Zet de wekker op zeven uur morgenochtend"
inputs = tokenizer(utterance, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze()
intent_id = int(probs.argmax())
confidence = float(probs[intent_id])
intent_name = model.config.id2label[intent_id]
print(f"intent: {intent_name}  confidence: {confidence:.3f}")
# intent: alarm_set  confidence: 0.97

Confidence-based routing

CONFIDENCE_THRESHOLD = 0.60  # below this, escalate to LLM

def route(utterance: str):
    inputs = tokenizer(utterance, truncation=True, max_length=64, return_tensors="pt")
    with torch.no_grad():
        probs = model(**inputs).logits.softmax(-1).squeeze()
    intent_id = int(probs.argmax())
    confidence = float(probs[intent_id])
    if confidence < CONFIDENCE_THRESHOLD:
        return {"action": "escalate_to_llm", "reason": "low_confidence", "confidence": confidence}
    return {"action": "route", "intent": model.config.id2label[intent_id], "confidence": confidence}

Benchmark — MASSIVE nl-NL test split (~2974 utterances)

All models below were evaluated under an identical preprocessing protocol: raw utt field fed to the tokenizer with no lowercasing, punctuation stripping, or NFC normalisation (per FitzGerald et al. 2022's "default spacing provided by annotators" rule), fp32, batch 64, max_length=64, CPU. Every competitor's predictions are remapped to our canonical intent ordering via config.id2label.

Head-to-head

Model Params (M) Disk (MB) Intent Acc Macro-F1 Throughput (utt/s, CPU) Dutch MASSIVE in training?
LokaalHub/nl-router-klein (ours) 46 178 0.856 0.815 178 Yes (nl-NL per-language FT)
qanastek/XLMRoberta-Alexa-Intents-Classification 278 2143 0.881 0.831 33 Yes (full MASSIVE 51-lang FT)
cartesinus/xlm-r-base-amazon-massive-intent 278 1077 0.806 0.687 32 No (en-US only)

Pinned revisions: see competitor_revisions.json at project root.

vs. published reference numbers (apples-to-apples with the MASSIVE paper)

The original MASSIVE paper (FitzGerald et al. 2022, arXiv:2204.08582, Table of per-language results) reports the following Dutch numbers under per-language fine-tuning on the test split, averaged over 3 seeds:

Model Params Training setup nl-NL intent accuracy
mT5-base 580M per-language FT 0.872 ± 0.012
XLM-R-base 278M per-language FT 0.868 ± 0.012
nl-router-klein (this repo) 46M per-language FT 0.856 ± 0.001
XLM-R-base 278M zero-shot from en-US 0.821 ± 0.014

nl-router-klein sits within 1σ of XLM-R-base's published variance band at 17% of its parameters. Our 3-seed standard deviation (±0.0014) is roughly 8× tighter than the paper's (±0.012) — a function of the smaller, more stable base model and deterministic training on fixed data. We clearly beat the 278M zero-shot baseline by 3.4 points, and we lose to per-language XLM-R-base by only 1.3 points (within their variance) while being 6× smaller and 6× faster on CPU.

Training

Data

Source License Role Size Weight
AmazonScience/massive (config nl-NL) CC-BY-4.0 only training source 11,514 train / 2,033 val / 2,974 test 100%

No data augmentation via translation — mixing in translated data from other languages would break apples-to-apples comparability with the FitzGerald 2022 baseline. The dataset is already human-annotated Dutch.

Method

60-way single-label classification fine-tune on RobBERTje-BORT. Per-language fine-tuning on MASSIVE nl-NL train only (no translation augmentation), AdamW (LR 3e-5, weight decay 0.01), cosine schedule with 10% warmup, 12 epochs, batch size 32, fp32. Training capped at 20 minutes wall-clock on an M1 Pro.

Inverse-frequency class weighting in the cross-entropy loss to recover the long-tail intents (some MASSIVE intents have <20 training examples in the nl-NL split). Without this, 5 intents collapsed to F1=0 in early experiments; with it, the minimum per-class F1 sits at 0.43 across statistically meaningful test classes.

R-Drop regularisation (α=1.0) — paired forward passes with a symmetric KL consistency loss. Doubles training cost but adds ~0.5 points of accuracy on this small-data task. Ablated explicitly: R-Drop alone won the sweep by +0.005 over baseline; combined with extra epochs it over-regularised.

Default HuggingFace RobertaClassificationHead (1 hidden layer + tanh). A deeper 2-hidden-layer GELU head was ablated and overfit on 11.5k examples — keep the default.

Preprocessing rule (critical). We feed the raw utt field to the tokenizer — no lowercasing, no punctuation stripping, no Unicode normalisation. This matches the FitzGerald 2022 protocol ("default spacing provided by annotators") and is what makes the benchmark numbers directly comparable to the published baselines. A silent lowercasing layer would shift the numbers ~0.5–1.5 points in either direction and break the apples-to-apples claim.

Three-seed reproduction: ship run is reported as the mean across seeds 42, 1, 7 — matching the MASSIVE paper's own variance convention.

Scope of the comparison

Our comparison with FitzGerald 2022's XLM-R-base and mT5-base numbers is same-evaluation parity, not training reproduction. We do not replicate the paper's exact training recipe (their LR, batch size, epoch count, etc.); their result uses XLM-R's own tuning, ours uses RobBERTje's own tuning. What is held invariant across all rows in the table:

  1. The test set (MASSIVE nl-NL test, 2974 utterances).
  2. The metric (top-1 intent accuracy).
  3. The preprocessing (raw utt, no normalisation).
  4. The label semantics (matched via config.id2label name-resolution, not by index).
  5. The reported variance structure (mean over 3 seeds).

The claim being made is therefore a capability claim — that a 46M Dutch-specialised model can approach a 278M multilingual model on this benchmark — not that we reproduced the paper's exact pipeline.

Reproducibility (evaluation protocol)

To reproduce the numbers above, follow these rules exactly (enforced in benchmark.py):

  1. Test set: AmazonScience/massive config nl-NL, split test. No filtering.
  2. Input: raw utt field. Do not lowercase, strip punctuation, NFC-normalise, or whitespace-collapse.
  3. Per-model tokenizer: AutoTokenizer.from_pretrained(id, revision=<sha>). Never share tokenizers across models.
  4. Tokenisation: tokenizer(utt, truncation=True, max_length=64, padding='longest').
  5. Precision: fp32 for all models.
  6. Model loading: AutoModelForSequenceClassification + raw logits + argmax. Never pipeline("text-classification").
  7. Device: CPU for both accuracy and throughput (fair across model sizes).
  8. Label remap: for every competitor, build src_id → canonical_id from config.id2label by matching intent names — do NOT hardcode indices. Fail loudly on unknown labels.
  9. Deterministic order: sort test examples by MASSIVE id before batching.
  10. Revision pinning: every competitor is loaded with an explicit revision=<sha> argument. SHAs committed in competitor_revisions.json; benchmark fails loudly if HF serves a different SHA.
  11. Variance: ship numbers are the mean over 3 seeds, matching FitzGerald 2022's convention.

Limitations

Important — please read before using. The 0.856 MASSIVE benchmark reflects the model's behaviour on the distribution it was trained on — the 60 MASSIVE intents in voice-assistant Dutch. It does not mean the model correctly classifies arbitrary Dutch utterances at this rate.

What the model handles well

  • Voice-assistant-style commands — alarms, calendar, weather, music, IoT, lists, news, email, transport queries that match the MASSIVE schema.
  • Short utterances (≤20 tokens) — MASSIVE is predominantly short user commands, and the model is trained on those distributions.
  • Standard Netherlands Dutch (nl-NL) — the training set is nl-NL; Flemish (BE-NL), Afrikaans, and Frisian are out of distribution.

What the model does poorly

These are reproducible failure modes:

  • Out-of-distribution intents. Any query that doesn't match one of the 60 MASSIVE intents — e.g., philosophical questions, long-form requests, therapy-style conversation, compound multi-step asks — will be confidently routed to the nearest-looking MASSIVE intent. Use a confidence threshold (e.g., max(softmax) < 0.6 → escalate to LLM) to catch these. The ECE of 0.033 means the confidences are reasonably well-calibrated.
  • Multi-intent utterances. MASSIVE is single-label. "Zet de wekker voor 7 uur en stuur een mail naar Jan" contains two intents; the model will return exactly one. Multi-intent decomposition is a separate upstream step.
  • Slot / argument extraction. This model only predicts the intent. It does not extract slots (recipient, time, quantity, etc.). Pair with a separate slot-filler, use the LLM for slot extraction after routing, or use regex/NER for structured arguments.
  • Long-form queries. Max input length is 64 tokens. Longer utterances are truncated — which will hurt accuracy for discursive inputs.
  • Code-switched Dutch-English. "Kun je mijn calendar checken" is common in Dutch tech speech but only lightly represented in MASSIVE. Expect degraded accuracy on heavily code-switched inputs.
  • Long-tail intents. Some MASSIVE intents have <50 training examples. The minimum per-class F1 on test is 0.43; the macro-F1 of 0.813 captures this. Don't assume all 60 intents are equally reliable.
  • Regional Dutch variants. Trained on nl-NL. BE-NL phrasing (Flemish idioms, different greeting conventions, different vocabulary for common concepts) is out of distribution.

Known systemic limitations

  • Voice-assistant taxonomy, not general agent taxonomy. The 60 MASSIVE intents were designed for smart-speaker use cases (Alexa-era). They do NOT include many modern LLM-agent tool categories — code execution, file operations, web search, document Q&A over a custom corpus, etc. If your app's tool taxonomy looks different, either (a) map your tools onto MASSIVE intents as a coarse router, or (b) use an embeddings-based semantic router over your own tool descriptions instead.
  • Closed-set classification. There is no "other" / "unknown" class. Every prediction is one of the 60 MASSIVE intents; OOD detection must be done by the caller via confidence thresholding.
  • Single language. Dutch only. Cross-language routing in a multilingual app needs a language-ID step upstream.
  • Calibration is indicative, not guaranteed. ECE = 0.033 is measured on MASSIVE test. On out-of-distribution inputs, calibration degrades in hard-to-predict ways.

When to pick klein vs. something else

klein is the fast, single Dutch-specialised tier — 46M parameters, 178.0 MB, 179 utt/sec on a single CPU core. It's the right choice when:

  • You need routing for a Dutch-first app where utterances look like MASSIVE's voice-assistant schema.
  • You need millisecond latency and offline operation.
  • You have a downstream LLM escalation path for low-confidence / OOD queries.

Pick something else when:

  • Your tool taxonomy doesn't map to MASSIVE's 60 intents → use an embeddings-based semantic router over your tool docs.
  • You need slot filling → add a separate slot extractor or use an LLM for full NLU.
  • You need multi-intent or discourse-level understanding → use an LLM.
  • You need multilingual routing → use qanastek/XLMRoberta-Alexa-Intents-Classification (278M, multilingual) instead.

Ethical considerations

  • Outputs are probabilities over 60 intents; no claims about the world.
  • For production routing with high-stakes downstream actions (financial transactions, medical triage, legal workflows), combine with human-in-the-loop review and confidence gating — do not route based on this model alone.
  • Runs fully locally; no data leaves the device at inference time.

Related models

Citation

@misc{nl_router_klein_2026,
  title  = {nl-router-klein: A Tiny Dutch Intent Classifier for Local LLM Agents},
  author = {LokaalHub},
  year   = {2026},
  url    = {https://huggingface.co/LokaalHub/nl-router-klein}
}

Please also cite the base model and the training data:

@inproceedings{fitzgerald2023massive,
  title     = {{MASSIVE}: A 1{M}-example multilingual natural language understanding dataset
               with 51 typologically-diverse languages},
  author    = {FitzGerald, Jack and others},
  booktitle = {Proceedings of ACL 2023},
  year      = {2023},
  url       = {https://arxiv.org/abs/2204.08582}
}

@inproceedings{delobelle2020robbert,
  title     = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
  author    = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  booktitle = {Findings of EMNLP 2020},
  year      = {2020}
}

@inproceedings{delobelle2021robbertje,
  title     = {{RobBERTje}: a Distilled {Dutch} {BERT} Model},
  author    = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  booktitle = {Proceedings of CLIN 31},
  year      = {2021}
}

Built in the Netherlands. Trained 2026-04-21.

Downloads last month
28
Safetensors
Model size
45.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LokaalHub/nl-router-klein

Finetuned
(3)
this model

Dataset used to train LokaalHub/nl-router-klein

Paper for LokaalHub/nl-router-klein

Evaluation results