nl-router-klein
Small Dutch intent classifier — 46M parameters, 60-way MASSIVE intent routing for local-first LLM workflows.
nl-router-klein ("klein" = small in Dutch) is a single-label classifier that reads a short Dutch utterance and predicts one of the 60 canonical MASSIVE intents. It's the fast, deterministic "decide what the user wants" step that sits before an LLM call — routing cheap intents to local tools, picking the right RAG index, selecting the right agent skill, or escalating ambiguous utterances to a full LLM.
Before you use this model, read the Limitations section carefully. The 0.856 MASSIVE benchmark reflects performance on MASSIVE-style voice-assistant utterances in the 60-intent schema; real-world queries that don't fit those 60 intents will be mis-routed unless you use a confidence threshold (see Confidence-based routing below).
It is the routing tier of the LokaalHub Dutch LLM-workflow family.
At a glance
| Base model | DTAI-KULeuven/robbertje-1-gb-bort (RobBERTje-BORT, distilled from RobBERT) |
| Parameters | 46M |
| Disk size | 178.0 MB (fp32) |
| Language | Dutch (nl) |
| Task | 60-way intent classification (MASSIVE schema) |
| License | Apache-2.0 |
| Training data | MASSIVE nl-NL (CC-BY-4.0) |
| MASSIVE nl-NL test intent acc | 0.856 (3-seed mean) |
| Macro-F1 | 0.813 (3-seed mean) |
| Scenario accuracy (18-way) | 0.900 |
| ECE (calibration) | 0.033 |
| CPU throughput | ~179 utterances/sec (single M1 P-core, fp32, batch 64, torch.set_num_threads(1)) |
What to use this for
Any Dutch LLM workflow where a tiny, fast, deterministic "decide what to do" step makes the whole pipeline better, cheaper, and more auditable.
Primary use cases
Local-vs-cloud gating. Route cheap intents (
alarm_set,weather_query,calendar_query) to hardcoded local tools; escalate open-ended intents (qa_factoid,general_quirky) to a full local or cloud LLM. A single inference on this 46M model takes ~2 ms on CPU vs. 500–5000 ms for even a local 7B LLM.Tool / skill selection in agent frameworks. Drop-in replacement for LLM-based function-calling when the tool taxonomy matches MASSIVE's intents. Deterministic, auditable, and ~100× faster than a full LLM round-trip.
RAG index selection. Map intent → knowledge source.
cooking_recipe→ cookbook index,news_query→ news index,qa_factoid→ encyclopedia index, etc.Confidence-based LLM escalation. Use
max(softmax) < thresholdto fall back to an LLM on out-of-distribution queries. Calibration (ECE = 0.033) is tracked so the threshold is meaningful.Deterministic routing logs. Unlike LLM-based routing, every decision is reproducible and auditable. Critical for governance-sensitive Dutch deployments (healthcare, legal, HR, government).
What makes it suitable for these
- Local and private. Runs fully on-device. No data leaves the laptop / VM / cluster. Critical for GDPR-sensitive Dutch workloads.
- Fast. ~179 utt/sec on a single M1 CPU core. You can route every incoming user message without perceivable latency.
- Small. 178.0 MB fits alongside an agent runtime, a local vector DB, and a lightweight generation model.
- Deterministic. Same input → same output → auditable routing logs.
- Cheap. Zero marginal cost per inference.
Read the Limitations section before committing — this is the klein (fast) tier, trained only on the 60 MASSIVE intents. Anything outside that taxonomy will be mis-routed unless you gate with a confidence threshold.
Quick start
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("LokaalHub/nl-router-klein")
model = AutoModelForSequenceClassification.from_pretrained("LokaalHub/nl-router-klein")
utterance = "Zet de wekker op zeven uur morgenochtend"
inputs = tokenizer(utterance, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = logits.softmax(-1).squeeze()
intent_id = int(probs.argmax())
confidence = float(probs[intent_id])
intent_name = model.config.id2label[intent_id]
print(f"intent: {intent_name} confidence: {confidence:.3f}")
# intent: alarm_set confidence: 0.97
Confidence-based routing
CONFIDENCE_THRESHOLD = 0.60 # below this, escalate to LLM
def route(utterance: str):
inputs = tokenizer(utterance, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
probs = model(**inputs).logits.softmax(-1).squeeze()
intent_id = int(probs.argmax())
confidence = float(probs[intent_id])
if confidence < CONFIDENCE_THRESHOLD:
return {"action": "escalate_to_llm", "reason": "low_confidence", "confidence": confidence}
return {"action": "route", "intent": model.config.id2label[intent_id], "confidence": confidence}
Benchmark — MASSIVE nl-NL test split (~2974 utterances)
All models below were evaluated under an identical preprocessing protocol: raw utt field fed to the tokenizer with no lowercasing, punctuation stripping, or NFC normalisation (per FitzGerald et al. 2022's "default spacing provided by annotators" rule), fp32, batch 64, max_length=64, CPU. Every competitor's predictions are remapped to our canonical intent ordering via config.id2label.
Head-to-head
| Model | Params (M) | Disk (MB) | Intent Acc | Macro-F1 | Throughput (utt/s, CPU) | Dutch MASSIVE in training? |
|---|---|---|---|---|---|---|
| LokaalHub/nl-router-klein (ours) | 46 | 178 | 0.856 | 0.815 | 178 | Yes (nl-NL per-language FT) |
| qanastek/XLMRoberta-Alexa-Intents-Classification | 278 | 2143 | 0.881 | 0.831 | 33 | Yes (full MASSIVE 51-lang FT) |
| cartesinus/xlm-r-base-amazon-massive-intent | 278 | 1077 | 0.806 | 0.687 | 32 | No (en-US only) |
Pinned revisions: see competitor_revisions.json at project root.
vs. published reference numbers (apples-to-apples with the MASSIVE paper)
The original MASSIVE paper (FitzGerald et al. 2022, arXiv:2204.08582, Table of per-language results) reports the following Dutch numbers under per-language fine-tuning on the test split, averaged over 3 seeds:
| Model | Params | Training setup | nl-NL intent accuracy |
|---|---|---|---|
| mT5-base | 580M | per-language FT | 0.872 ± 0.012 |
| XLM-R-base | 278M | per-language FT | 0.868 ± 0.012 |
nl-router-klein (this repo) |
46M | per-language FT | 0.856 ± 0.001 |
| XLM-R-base | 278M | zero-shot from en-US | 0.821 ± 0.014 |
nl-router-klein sits within 1σ of XLM-R-base's published variance band at 17% of its parameters. Our 3-seed standard deviation (±0.0014) is roughly 8× tighter than the paper's (±0.012) — a function of the smaller, more stable base model and deterministic training on fixed data. We clearly beat the 278M zero-shot baseline by 3.4 points, and we lose to per-language XLM-R-base by only 1.3 points (within their variance) while being 6× smaller and 6× faster on CPU.
Training
Data
| Source | License | Role | Size | Weight |
|---|---|---|---|---|
AmazonScience/massive (config nl-NL) |
CC-BY-4.0 | only training source | 11,514 train / 2,033 val / 2,974 test | 100% |
No data augmentation via translation — mixing in translated data from other languages would break apples-to-apples comparability with the FitzGerald 2022 baseline. The dataset is already human-annotated Dutch.
Method
60-way single-label classification fine-tune on RobBERTje-BORT. Per-language fine-tuning on MASSIVE nl-NL train only (no translation augmentation), AdamW (LR 3e-5, weight decay 0.01), cosine schedule with 10% warmup, 12 epochs, batch size 32, fp32. Training capped at 20 minutes wall-clock on an M1 Pro.
Inverse-frequency class weighting in the cross-entropy loss to recover the long-tail intents (some MASSIVE intents have <20 training examples in the nl-NL split). Without this, 5 intents collapsed to F1=0 in early experiments; with it, the minimum per-class F1 sits at 0.43 across statistically meaningful test classes.
R-Drop regularisation (α=1.0) — paired forward passes with a symmetric KL consistency loss. Doubles training cost but adds ~0.5 points of accuracy on this small-data task. Ablated explicitly: R-Drop alone won the sweep by +0.005 over baseline; combined with extra epochs it over-regularised.
Default HuggingFace RobertaClassificationHead (1 hidden layer + tanh). A deeper 2-hidden-layer GELU head was ablated and overfit on 11.5k examples — keep the default.
Preprocessing rule (critical). We feed the raw utt field to the tokenizer — no lowercasing, no punctuation stripping, no Unicode normalisation. This matches the FitzGerald 2022 protocol ("default spacing provided by annotators") and is what makes the benchmark numbers directly comparable to the published baselines. A silent lowercasing layer would shift the numbers ~0.5–1.5 points in either direction and break the apples-to-apples claim.
Three-seed reproduction: ship run is reported as the mean across seeds 42, 1, 7 — matching the MASSIVE paper's own variance convention.
Scope of the comparison
Our comparison with FitzGerald 2022's XLM-R-base and mT5-base numbers is same-evaluation parity, not training reproduction. We do not replicate the paper's exact training recipe (their LR, batch size, epoch count, etc.); their result uses XLM-R's own tuning, ours uses RobBERTje's own tuning. What is held invariant across all rows in the table:
- The test set (MASSIVE nl-NL test, 2974 utterances).
- The metric (top-1 intent accuracy).
- The preprocessing (raw
utt, no normalisation). - The label semantics (matched via
config.id2labelname-resolution, not by index). - The reported variance structure (mean over 3 seeds).
The claim being made is therefore a capability claim — that a 46M Dutch-specialised model can approach a 278M multilingual model on this benchmark — not that we reproduced the paper's exact pipeline.
Reproducibility (evaluation protocol)
To reproduce the numbers above, follow these rules exactly (enforced in benchmark.py):
- Test set:
AmazonScience/massiveconfignl-NL, splittest. No filtering. - Input: raw
uttfield. Do not lowercase, strip punctuation, NFC-normalise, or whitespace-collapse. - Per-model tokenizer:
AutoTokenizer.from_pretrained(id, revision=<sha>). Never share tokenizers across models. - Tokenisation:
tokenizer(utt, truncation=True, max_length=64, padding='longest'). - Precision: fp32 for all models.
- Model loading:
AutoModelForSequenceClassification+ raw logits + argmax. Neverpipeline("text-classification"). - Device: CPU for both accuracy and throughput (fair across model sizes).
- Label remap: for every competitor, build
src_id → canonical_idfromconfig.id2labelby matching intent names — do NOT hardcode indices. Fail loudly on unknown labels. - Deterministic order: sort test examples by MASSIVE
idbefore batching. - Revision pinning: every competitor is loaded with an explicit
revision=<sha>argument. SHAs committed incompetitor_revisions.json; benchmark fails loudly if HF serves a different SHA. - Variance: ship numbers are the mean over 3 seeds, matching FitzGerald 2022's convention.
Limitations
Important — please read before using. The 0.856 MASSIVE benchmark reflects the model's behaviour on the distribution it was trained on — the 60 MASSIVE intents in voice-assistant Dutch. It does not mean the model correctly classifies arbitrary Dutch utterances at this rate.
What the model handles well
- Voice-assistant-style commands — alarms, calendar, weather, music, IoT, lists, news, email, transport queries that match the MASSIVE schema.
- Short utterances (≤20 tokens) — MASSIVE is predominantly short user commands, and the model is trained on those distributions.
- Standard Netherlands Dutch (nl-NL) — the training set is nl-NL; Flemish (BE-NL), Afrikaans, and Frisian are out of distribution.
What the model does poorly
These are reproducible failure modes:
- Out-of-distribution intents. Any query that doesn't match one of the 60 MASSIVE intents — e.g., philosophical questions, long-form requests, therapy-style conversation, compound multi-step asks — will be confidently routed to the nearest-looking MASSIVE intent. Use a confidence threshold (e.g.,
max(softmax) < 0.6 → escalate to LLM) to catch these. The ECE of 0.033 means the confidences are reasonably well-calibrated. - Multi-intent utterances. MASSIVE is single-label. "Zet de wekker voor 7 uur en stuur een mail naar Jan" contains two intents; the model will return exactly one. Multi-intent decomposition is a separate upstream step.
- Slot / argument extraction. This model only predicts the intent. It does not extract slots (recipient, time, quantity, etc.). Pair with a separate slot-filler, use the LLM for slot extraction after routing, or use regex/NER for structured arguments.
- Long-form queries. Max input length is 64 tokens. Longer utterances are truncated — which will hurt accuracy for discursive inputs.
- Code-switched Dutch-English. "Kun je mijn calendar checken" is common in Dutch tech speech but only lightly represented in MASSIVE. Expect degraded accuracy on heavily code-switched inputs.
- Long-tail intents. Some MASSIVE intents have <50 training examples. The minimum per-class F1 on test is 0.43; the macro-F1 of 0.813 captures this. Don't assume all 60 intents are equally reliable.
- Regional Dutch variants. Trained on nl-NL. BE-NL phrasing (Flemish idioms, different greeting conventions, different vocabulary for common concepts) is out of distribution.
Known systemic limitations
- Voice-assistant taxonomy, not general agent taxonomy. The 60 MASSIVE intents were designed for smart-speaker use cases (Alexa-era). They do NOT include many modern LLM-agent tool categories — code execution, file operations, web search, document Q&A over a custom corpus, etc. If your app's tool taxonomy looks different, either (a) map your tools onto MASSIVE intents as a coarse router, or (b) use an embeddings-based semantic router over your own tool descriptions instead.
- Closed-set classification. There is no "other" / "unknown" class. Every prediction is one of the 60 MASSIVE intents; OOD detection must be done by the caller via confidence thresholding.
- Single language. Dutch only. Cross-language routing in a multilingual app needs a language-ID step upstream.
- Calibration is indicative, not guaranteed. ECE = 0.033 is measured on MASSIVE test. On out-of-distribution inputs, calibration degrades in hard-to-predict ways.
When to pick klein vs. something else
klein is the fast, single Dutch-specialised tier — 46M parameters, 178.0 MB, 179 utt/sec on a single CPU core. It's the right choice when:
- You need routing for a Dutch-first app where utterances look like MASSIVE's voice-assistant schema.
- You need millisecond latency and offline operation.
- You have a downstream LLM escalation path for low-confidence / OOD queries.
Pick something else when:
- Your tool taxonomy doesn't map to MASSIVE's 60 intents → use an embeddings-based semantic router over your tool docs.
- You need slot filling → add a separate slot extractor or use an LLM for full NLU.
- You need multi-intent or discourse-level understanding → use an LLM.
- You need multilingual routing → use
qanastek/XLMRoberta-Alexa-Intents-Classification(278M, multilingual) instead.
Ethical considerations
- Outputs are probabilities over 60 intents; no claims about the world.
- For production routing with high-stakes downstream actions (financial transactions, medical triage, legal workflows), combine with human-in-the-loop review and confidence gating — do not route based on this model alone.
- Runs fully locally; no data leaves the device at inference time.
Related models
LokaalHub/nl-nli-klein— sibling Dutch NLI model (groundedness / zero-shot classifier / summary faithfulness). Combine with this router for a full Dutch agentic-NLU pipeline.LokaalHub/nl-lokaal-kleinandnl-lokaal-middel— Dutch PII NER family.
Citation
@misc{nl_router_klein_2026,
title = {nl-router-klein: A Tiny Dutch Intent Classifier for Local LLM Agents},
author = {LokaalHub},
year = {2026},
url = {https://huggingface.co/LokaalHub/nl-router-klein}
}
Please also cite the base model and the training data:
@inproceedings{fitzgerald2023massive,
title = {{MASSIVE}: A 1{M}-example multilingual natural language understanding dataset
with 51 typologically-diverse languages},
author = {FitzGerald, Jack and others},
booktitle = {Proceedings of ACL 2023},
year = {2023},
url = {https://arxiv.org/abs/2204.08582}
}
@inproceedings{delobelle2020robbert,
title = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
booktitle = {Findings of EMNLP 2020},
year = {2020}
}
@inproceedings{delobelle2021robbertje,
title = {{RobBERTje}: a Distilled {Dutch} {BERT} Model},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
booktitle = {Proceedings of CLIN 31},
year = {2021}
}
Built in the Netherlands. Trained 2026-04-21.
- Downloads last month
- 28
Model tree for LokaalHub/nl-router-klein
Base model
DTAI-KULeuven/robbertje-1-gb-bortDataset used to train LokaalHub/nl-router-klein
Paper for LokaalHub/nl-router-klein
Evaluation results
- MASSIVE nl-NL test intent accuracy on MASSIVE (nl-NL)test set self-reported0.856
- macro F1 on MASSIVE (nl-NL)test set self-reported0.813